Secondary Structure Prediction Using Decision Lists

advertisement
Secondary Structure Prediction
Using Decision Lists
Deniz YURET
Volkan KURT
Outline
•
•
•
•
What is the problem?
What are the different approaches?
How do we use decision lists and why?
Why does evolution help?
What is the problem?
•
•
•
•
The generic prediction algorithm
Some important pitfalls: definition, data set
Upper and lower bounds on performance
Evolution and homology enters the picture
Tertiary / Quaternary Structure
Tertiary / Quaternary Structure
Secondary Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
----------HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm
• Sequence to Structure
• Structure to Structure
A Generic Prediction Algorithm:
Sequence to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
??????????????????????????????????????
A Generic Prediction Algorithm:
Sequence to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
-?????????????????????????????????????
A Generic Prediction Algorithm:
Sequence to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
-?????????????????????????????????????
A Generic Prediction Algorithm:
Sequence to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
--????????????????????????????????????
A Generic Prediction Algorithm:
Sequence to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
--????????????????????????????????????
A Generic Prediction Algorithm:
Sequence to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
---???????????????????????????????????
A Generic Prediction Algorithm:
Sequence to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
----H-----????????????????????????????
A Generic Prediction Algorithm:
Sequence to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
----H-----H???????????????????????????
A Generic Prediction Algorithm:
Sequence to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
----H-----H???????????????????????????
A Generic Prediction Algorithm:
Sequence to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
----H-----HH??????????????????????????
A Generic Prediction Algorithm:
Sequence to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
----H-----HHHHHHHHHH------EEEEE------?
A Generic Prediction Algorithm:
Sequence to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
----H-----HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm:
Structure to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
?---H-----HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm:
Structure to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
----H-----HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm:
Structure to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
-?--H-----HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm:
Structure to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
----H-----HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm:
Structure to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
--?-H-----HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm:
Structure to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
----H-----HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm:
Structure to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
----?-----HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm:
Structure to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
----------HHHHHHHHHH------EEEEE-------
A Generic Prediction Algorithm:
Structure to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
----------HHHHHHHHHH------EEEEE------?
A Generic Prediction Algorithm:
Structure to Structure
MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD
----------HHHHHHHHHH------EEEEE-------
Pitfalls for newcomers
• Definition of secondary structure
• Choice of data set
Pitfall 1:
Definition of Secondary Structure
•
•
•
•
•
DSSP: H, P, E, G, I, T, S
STRIDE: H, G, I, E, B, b, T, C
DEFINE: ???
Convert all to H, --, and E
They only agree 71% of the time!!!
(95% for DSSP and STRIDE)
• Solution: Use DSSP
Pitfall 2: Dataset
• Trivial to get 80%+ when homologies are
present between the training and the test set
• Homology identification keeps evolving
• RS126, CB513, etc.
• Comparison of programs on different data
sets meaningless…
Performance Bounds
• Simple baselines for lower bound
• A method for estimating an upper bound
Performance Bounds
• Baseline 1: 43% of all
residues are tagged
“loop”
43%: assign loop
Performance Bounds
• Baseline 2: 49% of all
residues are tagged
with the most frequent
structure for the given
amino-acid.
49%: assign most frequent
43%: assign loop
Performance Bounds
• Upper bound: Only
consider exact
matches for a given
frame size.
• As the frame size
increases accuracy
should increase but
coverage should fall.
100% ???
49%: assign most frequent
43%: assign loop
Upper Bound with Homologs
100 100.00100.00 99.97 95.15
88.87
90
95.42 95.85 95.83
80
64.7
70
60
50
49.11
53.68
56.74 56.66
43.14
40
30
20
16.21
13.09 12.51 12.21
10
0
1 2 3 4 5 6 7 8 9 Coverage
Frame Size
Bound
Upper Bound without Homologs
100 100.00100.00 99.98 95.26
90
75
80
70
60
50
40
47.87
52.59
55.91 54.95 53.07
72.72 74.19
60.34
35.93
30
20
10
3.83
0.46
0.19
0.14
0
1 2 3 4 5 6 7 8 9 Coverage
Frame Size
Bound
Performance Bounds
• Upper bound: Only
consider exact
matches for a given
frame size.
• As the frame size
increases accuracy
should increase but
coverage should fall.
100% ???
75%: estimated upper bound
49%: assign most frequent
43%: assign loop
The Miracle of Homology
• People used to be stuck at around 60%.
• Rost and Sander crossed the 70% barrier in
1993 using homology information.
• All algorithms benefit 5-10% from
homology.
• The homologues are of unknown structure,
training and test sets still unrelated!
• Why?
The Miracle of Homology
60%
The Miracle of Homology
70%
Outline
•
•
•
•
What is the problem?
What are the different approaches?
How do we use decision lists and why?
Why does evolution help?
GORV
Sequence
Secondary
Structure
66.9%
Information
Function / Bayesian
Statistics
+6.5%
PSI-BLAST
Majority Vote
Filter
Secondary
Structure
Secondary
Structure
+73.4%
* Garnier et al, 2002
PHD
Frequency Profile
HSSP
Secondary
Structure
+4.3%
Neural Network
Neural Network
62.6% / 67.4%
+3.4%
Secondary
Structure
61.7% / 65.9%
* Rost & Sander, 1993
Secondary
Structure
70.8%
Jury +
Filter
JNet
Profile
Secondary
Structure
PSIBLAST
HMMER2
CLUSTALW
Neural Network
Neural Network
Jury +
Jury Network
Secondary
Structure
Secondary
Structure
76.9%
* Cuff & Barton, 2000
PSIPRED
Profiles
PSI-BLAST
Secondary
Structure
Neural Network
Neural Network
Secondary
Structure
Secondary
Structure
76.3%
* Jones, 1999
Outline
•
•
•
•
What is the problem?
What are the different approaches?
How do we use decision lists and why?
Why does evolution help?
Introduction to Decision Lists
• Prototypical machine learning problem:
– Decide democrat or republican for 435
representatives based on 16 votes.
Class Name: 2 (democrat, republican)
1. handicapped-infants: 2 (y,n)
2. water-project-cost-sharing: 2 (y,n)
3. adoption-of-the-budget-resolution: 2 (y,n)
4. physician-fee-freeze: 2 (y,n)
5. el-salvador-aid: 2 (y,n)
6. religious-groups-in-schools: 2 (y,n)
…
16. export-administration-act-south-africa: 2 (y,n)
Introduction to Decision Lists
• Prototypical machine learning problem:
– Decide democrat or republican for 435
representatives based on 16 votes.
1. If adoption-of-the-budget-resolution = y
and anti-satellite-test-ban = n
and water-project-cost-sharing = y
then democrat
2. If physician-fee-freeze = y
then republican
3. If TRUE then democrat
The Greedy Prepend Algorithm
Rule Search
• Initially evertyhing is predicted to be the
mostly seen structure (i.e. loop)
False
Assignments
Correct
Assignments
+
-
Partition with
respect to the
Base Rule
Training Set
Rule Search
• At each step add the maximum gain rule
+
+
-
-
+
Partition with
respect to the
Base Rule
Partition with
respect to the
Second Rule
GPA Rules
• The first three rules of the sequence-tostructure decision list
– 58.86% performance (of 66.36%)
GPA Rule 1
• Everything => Loop
GPA Rule 2
HELIX
L4
L3
L2
L1
0
R1
R2
R3
R4
*
*
!GLY
!GLY
!ASN
!GLY
!PRO
!PRO
!PRO
!PRO
!GLY
!PRO
!PRO
!SER
(Non-polar
or large)
GPA Rule 3
STRAND
L4
L3
L2
L1
0
R1
R2
R3
R4
!LEU
!ALA
!ASP
!ALA
CYS
!PRO
!ARG
!LEU
!LEU
!ASP
ILE
!GLN
!MET
!MET
!GLY
LEU
!GLU
!PRO
PHE
!LYS
TRP
!PRO
TYR
(Non-Polar
and
Not
Charged)
!LEU
!GLN
!GLU
VAL
(Non-polar)
GPA
Sequence
Secondary
Structure
+6.67%
GPA
GPA
Secondary
Structure
60.48%
Secondary
Structure
62.54% / 69.21%
PSI-Blast
Experimental Setup
• DSSP assignments
• Reduction:
– E (extended strand), B (b bridge)-> Strand
– H (a helix ), G (3-10 helix) -> Helix
– Others -> Loop
• Data set:
– CB513 set
– 7-fold cross-validation
GPA Performance
• Performance of seq-to-struct decision list:
– Without homologs: 60.48% (29 to 66 rules)
– With homologs: 66.36% (46 to 68 rules)
• Performance with struct-to-struct filter:
– Without homologs: 62.54% (18 to 116 rules)
– With homologs: 69.21% (16 to 40 rules)
GPA Performance
• Performance at 20 rules at both steps:
– Without homologs: 62.15%
– With homologs: 69.08%
• Possible to make a back-of-the-envelope
structure prediction using our model
Comparison on CB513
•
•
•
•
•
PhD
72.3
NNSSP 71.7
GPA
69.2
DSC
69.1
Predator 69.0
Outline
•
•
•
•
What is the problem?
What are the different approaches?
How do we use decision lists and why?
Why does evolution help?
The Miracle of Homology
70%
Discussion
• Training set homologues and test set
homologues help for different reasons.
• Training set homologues use semi-accurate
guesses of structure to provide information
on amino-acid substitutions
• Test set homologues take advantage of
“independent errors” in prediction
• The less similar the homologue sequences
the better…
Summary
• Homologues between the training set and
the test set unfairly influence results.
• Homologues within the training set and the
test set still help significantly.
• There is an upper bound at around 75%
unless we use a homologue of the target
protein.
• Very different learning algorithms converge
on comparable accuracy.
Some Educated Guesses
• Significant progress probably requires
better homology detection rather than better
learning algorithms.
• To exceed the 75% bound one needs to start
incorporating long range interactions.
• CASP shows predicting tertiary structure
first gives compatible results – any use for
secondary structure?
Thank you…
• The algorithm, the paper, etc. available
from:
dyuret@ku.edu.tr
Introduction
• Protein Structure
– What is Secondary Structure?
– What is Tertiary Structure?
• Secondary structure Prediction
– What are decision lists?
– GPA in Action
• Tertiary Structure Prediction
Protein Structure
• Primary Structure
– Sequences
• Secondary Structure
– Frequent Motifs
• Tertiary Structure
– Functional Form
• Quaternary Structure
– Protein complexes
Primary Structure
• Sequence information
• Contains only aminoacid sequences
–
–
–
–
–
24 amino acid codes present
20 standard residues
Glutamine or Glutamic Acid  GLX (GLU)
Asparagine or Aspartic Acid  ASX (ASN)
Others (Non-natural/Unknown)  X
• Selenocysteine, Pyrrolysine
Secondary Structure
• Rigid structure motifs
• Do not give information about coordinates
of residues
• Can be seen as a one-dimensional reduction
of the tertiary structure
• If accurately predicted, can be used to
– Predict the final (tertiary) structure
– Predict the fold type (all-alpha/all-beta etc.)
Common Secondary Structure
Motifs
Parallel
betasheet
alfa-helix
Antiparalle
l betasheet
Tertiary/Quaternary Structure
• Tertiary Structure
– The functional form
– Coordinates of residues in the space
• Quaternary Structure
– Protein – Protein complexes
– Assembly of one or more proteins
Structure Prediction
Sequences vs. Structures
200000
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
19
86
19
88
19
90
19
92
19
94
19
96
19
98
20
00
20
02
20
04
Number of Entries
• Easier to
determine
sequence than
structure
• Predictions
may help
close the gap
Years
PDB
Swissprot
Secondary Structure Prediction
•
•
•
•
Assesment of Prediction Accuracy
Common Strategy
Methods in Literature
Decision Lists
– Prediction using GPA
• A Performance Bound
Secondary Structure Prediction
• Predictions based on
– Sequence Information
– Multiple Sequence Alignments
• Various algorithms present based on
– Information Theory
– Machine Learning
– Neural Networks etc.
Assessment of Accuracy
• Determination method
– DSSP
• Performance Metric
– Q3 accuracy
– Three state accuracy (helix/strand/loop)
• Data set selection
– Non-redundancy
• Homology Information
– Multiple Sequence Alignments
• Cross-Validation
Two Levels of Prediction
Sequence
• First Level:
– Sequence to Structure
• Input:
– Sequence Information
– Multiple Sequence
Alignments
• Method:
MSA
Sequence to
Structure
– Machine Learning
– Neural Networks
• Output
– Secondary Structure
Secondary
Structure
Two Levels of Prediction
• Second Level:
Secondary
Structure
– Structure to Structure
• Input:
– Structure Information
• Method:
Structure to
Structure
– Machine Learning
– Neural Networks
– Filter
• Simple Filters
• Jury Decisions
• Output
– Secondary Structure
Filter
Secondary
Structure
Decision Lists
•
•
•
•
•
Machine Learning method
Simply, a list of rules
Each rule asserts a guess
Generalization by simple rule pruning
Output is human readable/understandable
GPA
• Greedy Decision List
• Start with a global (base) rule
• At every step
– Find the maximum gain rule
– Append to previous list
• Stop when gain change is 0
Data Representation
• Frames of length W
– Context of an aminoacid is represented by W
residues
– (W-1)/2 to the left. (W-1)/2 to the right
– If the frame exceeds terminii, they are
represented as NAN
– GLX = GLN. ASX = ASN.
– New found/Non Natural aa’s = X
Sample Data
• evealekkv[aaLes]vqalekkvealehg
helix
• Frame Size = 5
• Represents the features used in the
prediction of secondary structure for L
(leucine)
2-level Algorithm
• Sequence to Structure List
–
–
–
–
Find the first rule that matches the data point
Assign the output of that rule
A frame of 9 residues is input
Output: Secondary Structure
• Structure to Structure List
– After all predictions are made, check for possible
improvements
– A frame of 19 secondary structures is input
– Output: Secondary Structure
GPA/PHD/GORV
Level 1 - No
Homolog
Level 1 –
Homolog
Level 2 – No
Homolog
Level 2 –
Homolog
Final
GPA
60.48%
PHD
61.7%
GORV
N/A
66.36%
65.9%
N/A
62.54%
62.6%
66.9%
69.21%
67.4%
71.8%
69.21%
70.8%
73.4%
Discussion - Why GPA?
• Amazingly simple models
– With as low as 20 rules in the first level and as
low as 20 rules in the second
• Rules (Models) are human-readable
– Biological rules may be inferred
• Second level decision list may be used as a
filter for other algorithms
A Performance Bound Claim
• Using only sequence information. the highest
achievable performance has an upper bound
• The lower bound:
– 43%. with everything assigned as loop
– 49%. with every residue assigned the most probable
structure
• The upper bound
– 75%. with non-homologous data
A Performance Bound Claim
• Bound is calculated by:
– Taking only the exact sequence matches in the training
and testing sets
– Assign the mostly seen value of that frame in the
training set as guess
– Compare with actual value
• A bound for non-homologous training and testing
sets
• A bound for carefully selected frame size
– Not too short (assignments would be almost random)
– Not too long (only unique frames will be available)
Upper Bound with Homologs
100 100.00100.00 99.97 95.15
95.42 95.85 95.83
88.87
90
80
64.7
70
60
50
49.11
53.68 56.74 56.66
43.14
40
30
20
16.21
13.09 12.51 12.21
10
0
1 2 3 4 5 6 7 8 9 Coverage
Bound
Upper Bound without Homologs
100 100.00100.00 99.98 95.26
90
75
80
70
60
50
40
47.87
55.91 54.95
53.07
52.59
72.72 74.19
60.34
35.93
30
20
10
3.83
0.46
0.19
0.14
0
1 2 3 4 5 6 7 8 9 Coverage
Bound
Tertiary Structure Prediction
• Predictions based on
backbone dihedral
angles
– Phi and Psi angles fully
define the tertiary
structure
• Goal:
– Discover the right level
of granularity
Data Set Selection
• PDB-Select
– A set of non-homologous proteins of high
resolution [Hobohm & Sander, 1994]
• Data representation
– Frames of 9 residues
– Residue names plus residue properties
• Hydrophobicity, polarity, volume, charge etc.
• Train/Validation/Test
Data Discretization
• Phi/Psi angles are continuous
– We need a discrete representation to predict
them in a decision list
• Split the (-180, 180) region into bins
• Split the Ramachandran into bins
Ramachandran Plot (1)
Ramachandran Plot (2)
* Karplus, 1996
How to Predict?
• Predictions using sequence information
– No homology information
• Predicted angles may be incorporated
– Upper bounds will be given
• Accuracy
– Percent of correct estimates
– RMSD of phi and psi angles
Using Predicted Angles
Performance: Accuracy
Phi
Psi
Combined
15
30
60
90
120
15
30
60
90
120
Region
Secondary
All
37.27
51.22
64.87
77.15
80.51
32.79
52.06
68.81
76.04
80.44
58.75
71.82
Same
31.90
44.52
61.44
68.44
78.26
30.47
49.05
64.97
71.99
77.42
58.04
73.05
Identical
31.56
43.89
61.37
66.81
78.12
29.84
48.39
64.72
72.00
76.64
56.40
71.40
None
29.40
42.58
61.23
59.92
78.11
22.83
35.69
49.85
53.90
60.77
39.38
53.18
All
36.64
50.36
63.57
76.84
79.51
31.08
48.29
65.22
71.74
75.58
57.20
69.60
Same
31.64
43.81
61.66
67.24
78.19
29.89
47.41
63.74
70.16
74.81
56.47
69.51
Identical
31.49
43.82
61.39
66.44
78.10
29.47
47.35
63.57
69.56
74.56
56.60
69.42
All
32.33
46.78
62.27
68.14
79.05
31.60
49.87
67.67
75.57
79.78
54.96
69.21
Same
31.23
44.29
61.48
66.72
78.16
28.47
45.96
63.34
70.83
76.26
54.60
68.66
Identical
31.03
43.32
61.32
66.61
78.12
28.64
45.80
63.35
70.22
74.94
55.07
68.59
All
Right
Left
Performance: RMSD
Phi
Psi
Combined
15
30
60
90
120
15
30
60
90
120
Region
Secondary
All
42.95
44.37
46.67
51.13
52.98
66.38
66.01
71.23
70.84
53.07
56.93
48.11
Same
47.68
46.90
50.95
57.37
55.95
71.30
69.56
75.38
75.21
57.02
57.56
47.09
Identical
47.70
47.42
51.19
58.07
56.13
72.25
70.15
75.80
75.19
57.99
60.64
48.60
None
49.24
48.45
51.60
61.11
56.15
99.69
101.12
103.49
106.83
75.16
79.98
65.34
All
43.41
44.95
47.71
51.49
54.31
73.97
72.02
75.54
75.64
59.30
59.31
50.37
Same
48.17
47.06
50.24
58.33
56.04
74.24
74.37
78.21
78.64
60.23
58.96
50.50
Identical
48.26
47.72
51.08
58.73
56.16
74.13
73.57
78.58
80.03
60.52
58.94
50.61
All
45.41
45.48
49.19
56.92
54.92
69.45
67.86
73.14
71.45
53.96
59.59
50.62
Same
47.27
47.06
50.80
58.94
56.08
73.92
72.82
77.88
77.72
58.47
60.26
51.24
Identical
47.37
48.56
51.39
59.01
56.13
73.15
72.47
78.00
78.72
60.07
59.73
51.32
All
Right
Left
Performance: Backbone RMSD
Performance: Input Features
Performance: Real Prediction
Future Work
• For tertiary structure predictions.
– The two-leveled approach may be applied to tertiary
structure predictions
– Homology information may be incorporated
• For secondary structure predictions.
– Should find better homologues and better
representations
– Incorporating sequence and homology information in
the structure to structure part may be an option
• For both predictions
– A reliability index for predicted structure
Download