bcmb8330-protein structure

advertisement
Computational Methods for
Protein Structure Prediction
Ying Xu
Outline

introduction to protein structures

the problem of protein structure prediction

why we can predict protein structure

protein tertiary structure prediction
– Ab initio folding
– homology modeling

protein threading
Protein and Structure
Protein
sequence
>1MBN:_ MYOGLOBIN (154 AA)
MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHL
KTEAEMKASEDLKKAGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI
PIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKEL
GYQG
Protein
structure
Protein
function
Oxygen storage
Ball and stick
spacefill
Protein Structure

protein sequence folds into a “unique” shape (“structure”) that
minimizes its free potential energy
Protein Structures

Primary sequence
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

Secondary structure
-helix
anti-parallel
-sheet
parallel
Protein Structures

Tertiary structure

Quaternary structure
Protein Structures

Protein structure
– generally compact

Soluble protein structure
– individual domains are generally globular
– they share various common characteristics, e.g. hydrophobic
moment profile

Membrane protein structure
most of the amino acid sidechains of transmembrane
segments are non-polar
polar groups of the polypeptide backbone of transmembrane
segments generally participate in hydrogen bonds
Protein Tertiary Structures
Family: Clear evolutionary relationship, protein in the same
family are homologous, sequence identity >=30%.
Superfamily: Low sequence identity, probable common evolutionary
origin.
Fold: May not have a common evolutionary origin.
Major structural similarity.
Class: all-, all-, /, +, …
http://scop.mrc-lmb.cam.ac.uk/scop/
SCOP 1.65 release: 2327 families, 1294 superfamilies, 800 folds
SCOP: Structural Classification Of Proteins
Protein Tertiary Structures
Protein Structure Determination

X-ray crystallography
–
–
–
–

NMR
–
–
–
–

most accurate
in vitro
need crystals proteins
~50K per structure
accurate
in vivo
no need for crystals
limited to small proteins
Cryo-EM
– Imaging technology
– Low-resolution
Protein Structure Prediction
Problem:
Given the amino acid sequence of a protein,
what’s its 3-dimensional shape?
MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHL
KTEAEMKASEDLKKAGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI
PIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKEL
GYQG
?
……..
Why Protein Structure Prediction?
Importance of protein structure

–
–
–

knowledge of the structure of a protein enable us to understand
its function and functional mechanism
design better mutagenesis experiments
structure-based rational drug design
Experimental methods for protein structure determination
Pros: high resolution
Cons: time-consuming and very expensive
Why Protein Structure Prediction?

Big gap between the number of protein sequences and the
number of protein structures
– Uniprot/Swiss-prot, 283,454 protein sequences
– Uniprot/TrEMBL, 4,864,587 gene sequences
– PDB (Protein Data Bank), 48,385 protein structures

Fundamental, unsolved, challenging problem
Why We Can Predict Structure

In theory, a protein structure can solved
computationally

A protein folds into a 3D structure to minimizes its
free potential energy
– Anfinsen’s classic experiment on Ribonuclease A
folding in the 1960’s
– energy functions

This problem can be formulated as an optimization
problem
– protein folding problem, or ab initio folding
Why We Can Predict Structure

While there could be billions of billions of proteins in nature, the
number of unique structural folds (shapes) might be small
90% of new structures submitted to PDB in the past
three years have similar structural folds in PDB
Why We Can Predict Structure

Theoretical studies suggest that the vast majority of the proteins
in nature fall into not much more than 1000 structural folds

This realization has fundamentally changed how protein
structures can be predicted

The structure prediction problem becomes for a protein
sequence, find which of the structural folds the protein can fold
into, plus possibly some structural refinement
MTYKLILN …. NGVDGEWTYTE
Computational Methods for Protein
Structure Prediction
ab initio
--use first principles to fold proteins
--does not require templates
--high computational complexity
•homology modeling
--similar sequence similar structures
--practically very useful, need homologues
• protein threading
--many proteins share the same structural fold
--a folding problem becomes a fold recognition
problem
Need known
protein structures
ab initio structure prediction
An energy function to describe the protein
o
o
o
o
o
bond energy
bond angle energy
dihedral angel energy
van der Waals energy
electrostatic energy
Efficient and reliable algorithms to search the conformational
space to minimize the function and obtain the structure.
ab initio structure prediction

The problem is exceedingly difficult to solve
– the search space is defined by psi/phi angles of backbone and
side-chain positions
– the search space is enormous even for small proteins!
– the number of local minima increases exponentially of the number
of residues

Theoretically solvable but practically infeasible!
ROSETTA (Dave Baker’s Lab)

Construct a library of small structure fragments, e.g. 9 AA

Cut a target sequence to sequence fragments. For each
sequence fragment, choose structural candidate fragments from
the fragment library

Assemble the fragment structures by Monte Carlo simulation

The generated structures are grouped into clusters

Clusters are ranked by their energy
Homology Modeling
Observation: proteins with similar sequences tend to fold into
similar structures.
1. Target sequence is aligned with the sequence of a known structure,
they usually share sequence identity of 30% or higher
2. Superimpose target sequence onto the template,
replacing equivalent side-chain atoms where necessary
3. Refine the model by minimizing an energy function.
Programs:
Modeller
Swiss-Model
http://salilab.org/modeller/
http://swissmodel.expasy.org//SWISS-MODEL.html
Protein Threading

Basic premise
The number of unique structural (domain) folds in nature
is fairly small (possibly a few thousand)

Statistics from Protein Data Bank (~48,000 structures)
90% of new structures submitted to PDB in the past
three years have similar structural folds in PDB

Chances for a protein to have a native-like structural fold in
PDB are quite good (estimated to be 60-70%)
– Proteins with similar structural folds could be homologues or analogues
Protein Threading

The goal: find the “correct” sequence-structure alignment
between a target sequence and its native-like fold in PDB
MTYKLILN …. NGVDGEWTYTE

Energy function – knowledge (or statistics) based rather
than physics based
– Should be able to distinguish correct structural folds from
incorrect structural folds
– Should be able to distinguish correct sequence-fold alignment
from incorrect sequence-fold alignments
Protein Threading – four basic components

Structure database

Energy function

Sequence-structure alignment algorithm

Prediction reliability assessment
Protein Threading – structure database

Build a template database
Protein Threading – structure database
• Non-redundant representatives through structure-structure
and/or sequence-sequence comparison
FSSP (http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html)
(Families of Structurally Similar Proteins)
SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/)
PDB-Select (http://www.sander.embl-heidelberg.de/pdbsel/)
Pisces (http://www.fccc.edu/research/labs/dunbrack/pisces/)
Protein Threading – energy function
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
how preferable to put
two particular residues
nearby: E_p
how well a residue fits
a structural
environment: E_s
alignment gap
penalty: E_g
total energy: E_p + E_s + E_g
find a sequence-structure alignment
to minimize the energy function
Protein Threading – energy function

A singleton energy measures each residue’s preference in a
specific structural environments
– secondary structure
– solvent accessibility

Where
Compare actual occurrence against its “expected value” by
chance
Protein Threading – energy function


A simple definition of structural environment
– secondary structure: alpha-helix, beta-strand, loop
– solvent accessibility: 0, 10, 20, …, 100% of accessibility
– each combination of secondary structure and solvent
accessibility level defines a structural environment
• E.g., (alpha-helix, 30%), (loop, 80%), …
E_s: a scoring matrix of 30 structural environments by 20 amino
acids
– E.g., E_s ((loop, 30%), A)
Singleton energy term
Protein Threading – energy function
Helix
ALA
ARG
ASN
ASP
CYS
GLN
GLU
GLY
HIS
ILE
LEU
LYS
MET
PHE
PRO
SER
THR
TRP
TYR
VAL
Buried Inter
-0.578 -0.119
0.997 -0.507
0.819 0.090
1.050 0.172
-0.360 0.333
1.047 -0.294
0.670 -0.313
0.414 0.932
0.479 -0.223
-0.551 0.087
-0.744 -0.218
1.863 -0.045
-0.641 -0.183
-0.491 0.057
1.090 0.705
0.350 0.260
0.291 0.215
-0.379 -0.363
-0.111 -0.292
-0.374 0.236
Exposed
-0.160
-0.488
-0.007
-0.426
1.831
-0.939
-0.721
0.969
0.136
1.248
0.940
-0.865
0.779
1.364
0.236
-0.020
0.304
1.178
0.942
1.144
Sheet
Buried Inter
0.010 0.583
1.267 -0.345
0.844 0.221
1.145 0.322
-0.671 0.003
1.452 0.139
0.999 0.031
0.177 0.565
0.306 -0.343
-0.875 -0.182
-0.411 0.179
2.109 -0.017
-0.269 0.197
-0.649 -0.200
1.249 0.695
0.303 0.058
0.156 -0.382
-0.270 -0.477
-0.267 -0.691
-0.912 -0.334
Exposed
0.921
-0.580
0.046
0.061
1.216
-0.555
-0.494
0.989
-0.014
0.500
0.900
-0.901
0.658
0.776
0.145
-0.075
-0.584
0.682
0.292
0.089
Loop
Buried Inter
0.023 0.218
0.930 -0.005
0.030 -0.322
0.308 -0.224
-0.690 -0.225
1.326 0.486
0.845 0.248
-0.562 -0.299
0.019 -0.285
-0.166 0.384
-0.205 0.169
1.925 0.474
-0.228 0.113
-0.375 -0.001
-0.412 -0.491
-0.173 -0.210
-0.012 -0.103
-0.220 -0.099
-0.015 -0.176
-0.030 0.309
Exposed
0.368
-0.032
-0.487
-0.541
1.216
-0.244
-0.144
-0.601
0.051
1.336
1.217
-0.498
0.714
1.251
-0.641
-0.228
-0.125
1.267
0.946
0.998
Protein Threading – energy function

It measures the preference of a pair of amino acids to be
close in 3D space.

How close is close?
– distance dependent
– single cutoff
– C, C, or centroid of the sidechain

Observed occurrence of a pair compared with its
“expected” occurrence
Pair-wise interaction energy term
Protein Threading – energy function
ALA
ARG
ASN
ASP
CYS
GLN
GLU
GLY
HIS
ILE
LEU
LYS
MET
PHE
PRO
SER
THR
TRP
TYR
VAL
-140
268
105
217
330
27
122
11
58
-114
-182
123
-74
-65
174
169
58
51
53
-105
ALA
-18
-85
-616
67
-60
-564
-80
-263
110
263
310
304
62
-33
-80
60
-150
-132
171
ARG
-435
-417
106
-200
-136
-103
61
351
358
-201
314
201
-212
-223
-231
-18
53
298
ASN
17
278 -1923
67 191 -115
140 122
10
68
-267
88 -72 -31 -288
-454 190 272 -368
74 -448
318 154 243 294 179 294 -326
370 238
25 255 237 200 -160 -278
-564 246 -184 -667
95
54 194 178 122
211
50
32 141
13
-7 -12 -106 301 -494
284
34
72 235 114 158 -96 -195 -17 -272 -206
-28 105 -81 -102 -73 -65 369 218 -46
35 -21 -210
-299
7 -163 -212 -186 -133 206 272 -58 193 114 -162 -177
-203 372 -151 -211 -73 -239 109 225 -16 158 283 -98 -215 -210
104
52 -12 157 -69 -212 -18
81
29
-5
31 -432 129
95
268
62 -90 269
58
34 -163 -93 -312 -173
-5 -81 104 163
431 196 180 235 202 204 -232 -218 269 -50 -42
46 267
73
ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR
-20
-95
101
TRP
-6
107 -324
TYR VAL
Protein Threading – energy function
w(k) = h + gk, k ≥ 1, w(0) = 0;
Where h and g are constants.
h: opening gap penalty
g: extension gap penalty
FDSK---THRGHR
:.:
:: :::
FESYWTCTH-GHR
FDSK-T--HRGHR
:.: : : :::
FESYWTCTH-GHR
gap penalty term
Protein Threading – energy function
•
Secondary structure prediction is mature and can achieve ~80%
accuracy
•
The performance of using probabilities of the predicted three
secondary structure states (-helices, -strand, and loop) is
better
Secondary structure match energy
Threading Parameter Optimization

The contribution of each term (weight).

Based on threading performance on a training set (fold
recognition and alignment accuracy).

Different weight for different classes? (superfamily, fold)
pair-wise may contribute more for fold level threading
mutation/profile terms dominate in superfamily level threading
Etotal = sEsingleton + pEpairwise + gEgap + ssEss
Protein Threading -- algorithm

Considering only singleton energy + gap penalty

Represent a structure a sequence of “structural environments”
– (helix, 100%), (helix, 90%), ….. (strand, 0%)

Align a sequence MACKLPV …. with a structural sequence
(helix, 100%), (helix, 90%), ….. (strand, 0%)
Protein Threading – dynamic programming
AAGG
Two sequences: AACG and AAGG
| |
|
AACG
Step #1: calculating alignment matrix
A
A
C
G
A
2
2
-1
-1
A
2
4
3
G
-1
3
3
5
G
-1
2
2
5
2
Rule:
1: initialization– fill the
first row and column
with matching scores
2: fill an empty cell
based on scores of its
left, upper and upperleft neighbors + the
matching score of the
current cell
3: chose the one giving
the highest score
Protein Threading – dynamic programming
Step #2: Tracing back to recover the alignment
A
A
C
G
A
2
2
-1
-1
A
2
4
3
2
G
-1
3
3
5
G
-1
2
2
5
Rule:
1: start from the rightlower corner
2: trace back to left,
upper or upper-left
neighbor which gives
the current cell’s score
3. Keep doing this until
it cannot continue
Protein Threading – dynamic programming
Steps:
1. Initialization: construct an (n+1) x (m+1) matrix F for two sequences
of lengths n and m.
2. Matrix fill: for each cell in the matrix F, check all possible pathways
back to the beginning of the sequence (allowing insertions and
deletions) and give that cell the value of the maximum scoring.
3. Traceback: construct an alignment back from the last cell in the
matrix (or the highest scoring) cell to give the highest scoring
alignment.
Protein Threading -- dynamic programming
(helix,
100%)
M
L
V
A
(helix,
90%)
(helix,
80%)
(loop,
80%)
Protein Threading -- algorithm

Considering all three energy terms

Considering the pair-wise interaction energy makes
the problem much more difficult to solve
– dynamic programming algorithm does not work any more!

There are other techniques that can be used to solve
the problem
Protein Threading -- algorithm


Dynamic programming
Heuristic algorithms for pair-wise interactions
– Frozen approximation algorithm (A. Godzik et al.)
– Double dynamic programming (D. Jones et al.)
– Monte carlo sampling (S.H. Bryant et al.)

Rigorous algorithms for pair-wise interactions
–
–
–
–

Branch-and-bound (R.H. Lathrop and T.F. Smith)
Divide-and-conquer (Y. Xu et al.) --PROSPECT
Linear programming (J. Xu et al.) –RAPTOR
Tree decomposition (L. Cai et al.)
Rigorous algorithm for treating backbone and side-chain
simultaneously (Li et al.)
Fold Recognition
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
Score = -1500
Score = -720
Score = -1120
Score = -900
Which one is the correct structural
fold for the target sequence if any?
The one with the highest score ?
Fold Recognition
Query sequence: AAAA
Template #1: AATTAATACATTAATATAATAAAATTACTGA
Better template?
Template #2: CGGTAGTACGTAGTGTTTAGTAGCTATGAA
Which of these two sequences will have better
chance to have a good match with the query
sequence after randomly reshuffling them?
Fold Recognition

Different template structures may have different background
scores, making direct comparison of threading scores against
different templates invalid

Comparison of threading results should be made based on how
standout the score is in its background score distribution rather
than the threading scores directly
Fold Recognition
Threading 100,000
sequences against a
template structure provides
the baseline information
about the background
scores of the template
By locating where the
threading score with a
particular query sequence,
one can decide how
significant the score, and
hence the threading result,
is!
Not significant
significant
Fold Recognition
Z-score =
score - average
standard deviation
--randomly shuffle the query sequence and calculate the alignment score
Fold Recognition

Significance score versus prediction specificity/sensitivity
Z-score
Fold Recognition

Examine feature space of threading alignments:
(singleton score, pair contact scores, secondary structure score,
hydrophobic moment score, ......) versus true/false fold recognition
-2000, -500, -35, -90, ......, true
-1000, -201, -11, -500, ......, false
false
-5020, -900, -20, -75, ......, true
-1050, -185, -18, -320, ......, false
true
......

Separate true ones from false ones using support vector
machine (SVM)
Fold Recognition

Each feature has somewhat different distributions in the true and
false predictions
 E.g., hydrophobic moments (Hydrophobic moments of protein structures: spatially profiling the
distribution, David Silverman, PNAS 2001 98: 4996-5001) is quite useful in distinguishing
true from false threading predictions
Hydrophobic Moment profiles
120
100
80
60
H2(d)
40
20
0
-20
0
5
10
15
20
-40
-60
-80
-100
d (Angstromes)
25
30
35
Application
Protein sequence
Sequence
preprocessing
Remove signal peptide
Membrane/soluble??
Domain prediction
Sequence profile generation
Secondary structure Prediction
Experimental data constraints
NO
Database
searching
Find homolog
in PDB?
Fold
recognition
YES
Homology
modeling
3D
structure
Challenging Issues

Much improved energy functions for threading

Considering side-chain information when doing
protein threading

Effective integration of fragment-based methods and
threading techniques
Download