What parsing algorithms can tell us about protein folding Julia Hockenmaier

advertisement
What parsing algorithms
can tell us about
protein folding
Julia Hockenmaier
Computer Science, UIUC
http://www.cs.uiuc.edu/~juliahmr
juliahmr@cs.uiuc.edu
Two unrelated facts
• People understand language.
• Proteins fold into unique
3D structures.
Two unrelated facts
• People understand language.
• Proteins fold into unique
3D structures.
Natural language understanding
and protein folding are both
really hard for computers
The need for protein folding
and structure prediction
The need for protein folding
and structure prediction
Designing new drugs
(which bind to proteins)
Understanding
misfolding diseases
(Alzheimer’s, etc.)
The need for natural
language understanding
!"#$%&'&()*+,-./012
34*5665756638/9:;<
=>?@ABCDEFGHIJ5KL@
AMNOPQ;RSTUV<=WXYZ
[\O]^_`;abcde>fghi
jklmPnopqklmPnrst
The need for natural
language understanding
Information extraction
(news, scientific papers)
Machine translation
!"#$%&'&()*+,-./012
34*5665756638/9:;<
=>?@ABCDEFGHIJ5KL@
AMNOPQ;RSTUV<=WXYZ
[\O]^_`;abcde>fghi
jklmPnopqklmPnrst
Dialog systems
(phone, robots)
Parsing: a necessary first step
!"#$%&'&()*+,-./012
34*5665756638/9:;<
=>?@ABCDEFGHIJ5KL@
AMNOPQ;RSTUV<=WXYZ
[\O]^_`;abcde>fghi
jklmPnopqklmPnrst
Parsing: a necessary first step
!"#$%&'&()*+,-./012
34*5665756638/9:;<
=>?@ABCDEFGHIJ5KL@
AMNOPQ;RSTUV<=WXYZ
[\O]^_`;abcde>fghi
jklmPnopqklmPnrst
• What are these symbols?
Parsing: a necessary first step
!"#$%&'&()*+,-./012
34*5665756638/9:;<
=>?@ABCDEFGHIJ5KL@
AMNOPQ;RSTUV<=WXYZ
[\O]^_`;abcde>fghi
jklmPnopqklmPnrst
• What are these symbols?
• How do they fit together?
I eat sushi with tuna.
I eat sushi with tuna.
I eat sushi with tuna.
I eat sushi with chopsticks.
I eat sushi with tuna.
I eat sushi with chopsticks.
• Language is ambiguous.
I eat sushi with tuna.
I eat sushi with chopsticks.
• Language is ambiguous.
• What is the most likely structure
for a given sentence?
Dependency graphs
describe sentence structures
I eat sushi with tuna.
I eat sushi with chopsticks.
Proteins are
amino acid sequences
Side chain
Side chain
Backbone
H
O
R
H
C‘
C
N
C
R
Side chain
N
H
H
C‘
O
H
O
R
H
C‘
C
N
C
R
Side chain
N
H
H
C‘
O
O
H
C‘
C
R
Side chain
Amino Acids
The 20 amino acids differ
only in their side chains
(hydrophobic or polar).
Proteins are
amino acid sequences
Side chain
Side chain
O
Backbone
H
C‘
C
R
Side chain
N
H
R
H
C
N
H
C‘
O
H
O
R
H
C‘
C
N
C
R
Side chain
N
H
H
C‘
O
O
H
C‘
C
R
Side chain
Amino Acids
The 20 amino acids differ
only in their side chains
(hydrophobic or polar).
Proteins fold into a unique
lowest-energy structure
(native state)
Folded structures are stabilized by
side chain contacts.
Contact graphs describe
protein structures
α-Helix:
β-Sheet:
Contact graphs describe
protein structures
α-Helix:
β-Sheet:
Contact graphs describe
protein structures
α-Helix:
β-Sheet:
Contact graphs describe
protein structures
α-Helix:
fast
folding
β-Sheet:
slow
folding
The Levinthal paradox:
Folding is a search problem
(C. Levinthal 1968)
A protein with 150 amino acids
300
has 10
possible structures.
How can it find its native state?
Two unrelated search problems
• Natural language parsing:
Find the grammatical structure
of a sentence.
• Protein folding:
Find the folded structure
of a protein chain.
Two similar search problems
Two similar search problems
Find the optimal
structure
of a sequence.
Two similar search problems
Find the optimal
structure
of a sequence.
• The structure is determined
by the sequence.
• The number of possible
structures is exponential
Solving both
search problems
Solving both
search problems
Structural
Representation
Solving both
search problems
Structural
Representation
Scoring
Function
Solving both
search problems
Search
Algorithm
Structural
Representation
Scoring
Function
Natural Language
Parsing
Solving the parsing
problem
Search
Algorithm
Structural
Representation
Scoring
Function
Grammars for natural
language parsing
• A grammar is a description of the
syntax of a particular language.
• There are many different
grammar formalisms
(programming languages for
grammars)
Context-free grammar
S → NP VP
VP → V NP
VP → VP PP
NP → NP PP
PP → P NP
NP → we
NP → sushi
V → eat
P → with
VP
V
NP
eat
V
sushi
VP
VP
NP
P
NP
P
PP
NP
with tuna
PP
NP
eat sushi with chopsticks
Solving the parsing
problem
Search
Algorithm
Structural
Representation
Scoring
Function
Statistical parsing
• We want the most likely parse
of a sentence:
argmax P(τ |s)
=
τ
∝
P(τ, s)
argmax
P(s)
τ
argmax P(τ, s)
τ
• We use machine learning to
estimate P(t,s)
Solving the parsing
problem
Search
Algorithm
Structural
Representation
Scoring
Function
The CKY parsing algorithm
(Younger ’67, Kasami ‘65)
S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi
We eat sushi
The CKY parsing algorithm
(Younger ’67, Kasami ‘65)
S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi
We eat sushi
The CKY parsing algorithm
(Younger ’67, Kasami ‘65)
NP
S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi
We eat sushi
The CKY parsing algorithm
(Younger ’67, Kasami ‘65)
NP
S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi
We eat sushi
The CKY parsing algorithm
(Younger ’67, Kasami ‘65)
NP
S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi
V
We eat sushi
The CKY parsing algorithm
(Younger ’67, Kasami ‘65)
NP
S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi
V
We eat sushi
The CKY parsing algorithm
(Younger ’67, Kasami ‘65)
NP
S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi
V
NP
We eat sushi
The CKY parsing algorithm
(Younger ’67, Kasami ‘65)
NP
S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi
V
NP
We eat sushi
The CKY parsing algorithm
(Younger ’67, Kasami ‘65)
NP
S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi
V
NP
We eat sushi
The CKY parsing algorithm
(Younger ’67, Kasami ‘65)
NP
S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi
V
NP
We eat sushi
The CKY parsing algorithm
(Younger ’67, Kasami ‘65)
NP
S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi
V
VP
NP
We eat sushi
The CKY parsing algorithm
(Younger ’67, Kasami ‘65)
NP
S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi
V
VP
NP
We eat sushi
The CKY parsing algorithm
(Younger ’67, Kasami ‘65)
NP
S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi
S
V
VP
NP
We eat sushi
The CKY parsing algorithm
(Younger ’67, Kasami ‘65)
NP
S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi
S
V
VP
NP
We eat sushi
The CKY parsing algorithm
(Younger ’67, Kasami ‘65)
NP
S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi
S
V
VP
NP
We eat sushi
The CKY parsing algorithm
(Younger ’67, Kasami ‘65)
NP
S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi
S
V
VP
NP
We eat sushi
The CKY parsing algorithm
(Younger ’67, Kasami ‘65)
NP
S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi
S
V
VP
NP
We eat sushi
Protein Folding
Solving the folding
problem
Search
Algorithm
Structural
Representation
Scoring
Function
Real proteins are
difficult to simulate
Blue Gene’s biggest success:
a protein with 21 amino acids
Blue Gene’s biggest success:
a protein with 21 amino acids
Folding@home’s biggest success:
a protein with 36 amino acids.
Blue Gene’s biggest success:
a protein with 21 amino acids
Folding@home’s biggest success:
a protein with 36 amino acids.
We need a simple model system
that captures the essential
properties of proteins.
Protein chains are
self-avoiding walks (SAWs)
• Proteins are connected chains.
Sequence-adjacent amino acids
are also physically adjacent.
• Proteins occupy space.
The space occupied by one amino acid
can’t be occupied by another one.
The HP model:
The simplest protein model
• Two kinds of
amino acids:
Hydrophobic
and polar.
P
H
H
P
H
H
H
P
H
H
H
P
P
H
H
P
• Proteins are SAWs on a
2D square lattice.
Sequence-adjacent amino acids
are up, down, left or right.
The HP model:
The simplest protein model
• Two kinds of
amino acids:
Hydrophobic
and polar.
P
H
H
P
H
H
H
P
H
H
H
P
P
H
H
P
• Proteins are SAWs on a
2D square lattice.
Sequence-adjacent amino acids
are up, down, left or right.
Solving the folding
problem
Search
Algorithm
Structural
Representation
Scoring
Function
Folding = energy minimization
• Every structure has an energy F.
• Physical systems want to be in
the state with the lowest energy.
• Proteins have a unique lowest-
energy state, their “native state”.
• This is why they fold.
The energy landscape
is funnel-shaped
Folding = downhill moves in the landscape
(Fig.: Dill and Chan’97)
The energy landscape
is funnel-shaped
Folding = downhill moves in the landscape
(Fig.: Dill and Chan’97)
Folding is driven by
the hydrophobic effect
Folded proteins have a
hydrophobic core:
- Proteins are surrounded by water.
- Minimizing contact between the
-
hydrophobic side chains and the water
is energetically favorable.
Hydrophobic contacts are favorable.
Where do we get the
energy function from?
• Physics-based:
(molecular dynamics, etc.)
• Statistics-based:
learn it from databases of
known protein structures.
• Simplified models:
we need to define it ourselves
The energy function
in the HP model
• Contact energies:
P
H
H
P
H
H
H
P
H
H
H
P
We only consider
P
HP seqences with
a unique native state.
H
H
P
Every HH contact
contributes -1.
•
• Folding is still NP hard.
The energy function
in the HP model
• Contact energies:
P
H
H
P
H
H
H
P
H
H
H
P
We only consider
P
HP seqences with
a unique native state.
H
H
P
Every HH contact
contributes -1.
•
• Folding is still NP hard.
Solving the folding
problem
Search
Algorithm
Structural
Representation
Scoring
Function
Structure prediction
vs.
protein folding
• We want to understand the
folding process.
• For this, it is not sufficient to
just predict the structure.
Zipping: Structure grows
by adding local contacts
(Fiebig & Dill ‘93)
Hierarchical folding:
Zipping and Assembly
• Zipping is local structure growth.
• Folding also requires assembly
of independent local structures.
Hierarchical folding:
Zipping and Assembly
• Zipping is local structure growth.
• Folding also requires assembly
of independent local structures.
Hierarchical folding:
Zipping and Assembly
• Zipping is local structure growth.
• Folding also requires assembly
of independent local structures.
Hierarchical folding:
Zipping and Assembly
• Zipping is local structure growth.
• Folding also requires assembly
of independent local structures.
If folding is hierarchical...
...folding routes are trees
If folding is hierarchical...
...folding routes are trees
If folding is hierarchical...
...folding routes are trees
If folding is hierarchical...
...folding routes are trees
Evidence for
hierarchical folding
Proteins have
recursive domains.
Some fragments
fold faster than
the whole chain.
Some fragments fold
by themselves.
(Fig.: G. Rose,1979)
Implementing
hierarchical search
A parsing-based search strategy:
The CKY algorithm searches
all binary trees defined
by a context-free grammar.
Can we use the same search strategy?
A new folding algorithm
(Hockenmaier, Joshi, Dill ‘07)
1. Split the chain into fragments.
2. Enter their structures into chart.
A new folding algorithm
(Hockenmaier, Joshi, Dill ‘07)
3. Combine small structures
(like a jigsaw puzzle)
A new folding algorithm
(Hockenmaier, Joshi, Dill ‘07)
4. Keep only lowest-energy structures
A new folding algorithm
(Hockenmaier, Joshi, Dill ‘07)
5. Top cell contains folded structure.
Extracting folding routes
Extracting folding routes
Charts as energy landscapes
H
H
X
X
X
X
X
P
P
H
P
P
H
P
H
Charts as energy landscapes
H
H
X
X
X
X
X
P
P
H
P
P
H
P
H
0
-1
-2
-3
-4
Charts as energy landscapes
0
-1
-2
-3
-4
Charts as energy landscapes
0
-1
-2
-3
-4
The chart landscape
determines
the amount of search
Fast
Medium
Slow
Folding rates and
native state topology
Real proteins
(Plaxco et al. 98)
HP proteins
with our algorithm
6
−3
−4
log(k)
10
Folding rate k (log )
4
2
0
−5
−6
−7
−8
-2
5
10
15
20
Relative Contact Order (%)
25
−9
5
6
7
8
9
10
Native CO
11
12
How well does this work?
• CKY is not guaranteed to find the
native state, but:
- 24,900 HP sequences of length 20 with
-
unique native state.
Each sequences has 42,000,000 states.
CKY finds the native state for 96.7% of
them.
• Folding speed is correlated with
contact order.
But...
But...
... Proteins don’t use
dynamic programming.
But...
... Proteins don’t use
dynamic programming.
They misfold, unfold,
refold....
But...
... Proteins don’t use
dynamic programming.
They misfold, unfold,
refold....
... Can we predict the
collective behavior of an
ensemble of protein
molecules?
Modeling the folding process
Modeling the folding process
• We assume folding is hierarchical:
Modeling the folding process
• We assume folding is hierarchical:
Modeling the folding process
• We assume folding is hierarchical:
• We model the process as a Markov chain
Modeling the folding process
• We assume folding is hierarchical:
• We model the process as a Markov chain
Modeling the folding process
• We assume folding is hierarchical:
• We model the process as a Markov chain
• We use CKY to construct this chain
for each protein.
Folding as a Markov chain
Folding as a Markov chain
rij
qi
qj
Folding as a Markov chain
rij
qj
qi
• The protein can be in (and move
between) a finite set of states.
Folding as a Markov chain
rij
qj
qi
• The protein can be in (and move
between) a finite set of states.
• A Markov chain defines the
probability of which state the
protein is in at time t.
An example of
hierarchical folding:
{(1,4)}
{}
1
{(1,4),(5,8)}
{(5,8)}
4 5
8
{(1,4),(1,8),(5,8)}
This is not allowed:
{(1,4)}
1
{(1,4),(5,8)}
{}
{(5,8)}
{(1,8)}
{(1,8),(5,8)}
4 5
8
{(1,4),(1,8),(5,8)}
Trees define states and
folding steps
Folding: from siblings to the parent.
Unfolding: from the parent to children.
Unfolding
Folding
Trees define states and
folding steps
Folding: from siblings to the parent.
Unfolding: from the parent to children.
Unfolding
Folding
Calculating folding rates
Folding rates depend on the
difference in energy between
children and the parent node.
We can estimate this difference.
Our folding algorithm
• Use “CKY” to find native state.
• Construct the Markov chain from
the parse chart.
• Let the protein fold!
(calculate the probability of
where it is at time 0,10,100...)
Our test sequence
• 16mer: helix and hairpin
• The Markov chain has 193 states
Energy
How the protein folds:
Time (logscale)
How the protein folds:
%&#$'()*+,-$.,/0$1%2,.23!2425/655
Energy
!:
85
&5
!%5
95
!%:
%55
%65
!65
%85
%&5
!6:
%95
%
%%
6%
7%
8%
!"#$
Time (logscale)
!75
Probability
65
How the protein folds:
%&#$'()*+,-$.,/0$1%2,.23!2425/655
Energy
!:
85
&5
!%5
95
!%:
%55
%65
!65
%85
%&5
!6:
%95
%
%%
6%
7%
8%
!"#$
Time (logscale)
!75
Probability
65
Evidence against
hierarchical folding
For some proteins,
the ends come
together early
during folding
(Maity et al, ‘05)
Probability
What experiments would see
What experiments would see
Native
Trap
9
12
1
6
12
1
6
13
2
5
13
2
3
16
16
Probability
9
What experiments would see
Native
Trap
9
9
12
1
6
12
1
6
13
2
5
13
2
3
16
Probability
16
Time (logscale)
What experiments would see
Native
Trap
9
9
12
1
6
12
1
6
13
2
5
13
2
3
16
Probability
16
Time (logscale)
What experiments would see
Native
Trap
9
9
12
1
6
12
1
6
13
2
5
13
2
3
16
Probability
16
Time (logscale)
What experiments would see
Native
Trap
9
9
12
1
6
12
1
6
13
2
5
13
2
3
16
Probability
16
Time (logscale)
What experiments would see
Native
Trap
9
9
12
1
6
12
1
6
13
2
5
13
2
3
16
Probability
16
Time (logscale)
What experiments would see
Native
Trap
9
9
12
1
6
12
1
6
13
2
5
13
2
3
16
Probability
16
Time (logscale)
What experiments would see
Native
Trap
9
9
12
1
6
12
1
6
13
2
5
13
2
3
16
Probability
16
Time (logscale)
What experiments would see
Native
Trap
9
9
12
1
6
12
1
6
13
2
5
13
2
3
16
Probability
16
Time (logscale)
What experiments would see
Native
Trap
9
9
12
1
6
12
1
6
13
2
5
13
2
3
16
Probability
16
Time (logscale)
What experiments would see
Native
Trap
9
9
12
1
6
12
1
6
13
2
5
13
2
3
16
Probability
16
Time (logscale)
What experiments would see
Native
Trap
9
9
12
1
6
12
1
6
13
2
5
13
2
3
16
Probability
16
Time (logscale)
As a line plot:
1
0.9
0.8
Probability
0.7
’Helix’
’Hairpin’
’End−to−End’
’(3,6)’
’(2,5)’
’Trap’
’Native’
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.5
1
1.5
Time
2
2.5
3
5
x 10
Experimental observations that “the
ends come together early” are not
evidence against hierarchical folding:
Macroscopic observations don’t
always correspond to microscopic
behavior:
Challenges
• Can these algorithms be applied to
realistic representations of
proteins?
• Can we define coarse-grained
representations of real proteins
(and energy functions) that don’t
require supercomputers?
To conclude....
Search
Algorithm
Structural
Representation
Scoring
Function
For a computer scientist,
protein folding and parsing
pose similar research questions
and require similar techniques.
Thank you!
http://www.cs.uiuc.edu/~juliahmr
My collaborators/mentors:
Ken A. Dill, UC San Francisco
Aravind Joshi, U. of Pennsylvania
Our funder:
National Science Foundation
Download