~400,000 peptide mass spectra

advertisement
http://bioinformatics.icmb.utexas.edu/OPD
~400,000 peptide mass spectra
A few diverse examples of proteins:
A muscle protein:
aspirin
A virus protein shell (“capsid”):
Watercolors by David Goodsell, Scripps
Outline
Part I
What dictates the 3D shape (“fold”) of proteins?
1. Primary structure of proteins
- amino acids & peptide bonds
2. Secondary structure of proteins
- “local” folding topology & predicting 2° structure
3. Tertiary structure of proteins
- “global” folding topology
- X-ray crystallography & NMR
- aligning structure computationally
- protein folding
- designing new structures
Part II
How do proteins interact with each other in the cell?
The levels of protein structure:
“ribbon” = Ca backbone
solvent accessible
surface
Different
representations
of a typical
globular protein
(myoglobin)
ribbon + stick-figure
side chains
all atoms drawn
at van der waals radii
Due to resonance forms of the peptide bond:
Peptide bonds (N-CO) are planar, so only allowed rotation
along amino acid backbone is around Ca-N and Ca-CO bonds
==> by convention angles called F & Y
Protein folding = the selection of F/Y angles & side chain angles
leading to low energy packing of the atoms
A Ramachandran plot shows only certain F/Y combinations are sampled,
dictated by steric hindrance of atoms neighboring peptide bond
Favored regions correspond to secondary structures
==> allowable “local” structural conformations
3 of the most common secondary structures
a helix
3.6 aa’s/turn
http://www.rtc.riken.go.jp/jouhou/image/protein/2ndst/2ndst.html
Amino acids vary in their intrinsic propensities to adopt
the different secondary structures
Given aa sequence, how to predict 2° structure? ==> PhD
input = 13 aa sliding window
- neural network, predicts 3 states: a helix, b strand, coil
& relative level of solvent accessibility
==> 3 state prediction accuracy ~72%
http://maple.bioc.columbia.edu/predictprotein/
Some proteins have unusual secondary structures that
span membrane => membrane proteins
How to identify transmembrane segments in a protein?
Current best approach, TMHMM
is based on Hidden Markov models.
Hidden states
A
B
C
...
0.4
0.1
0.3
...
Y
A generic HMM:
X
Hidden state seq: XXXXYYYYXXXY
Observable seq: CCBCCAAABCAC
Goal = recover hidden state sequence by
analyzing emissions
A
B
C
...
0.1
0.3
0.4
...
transition
probabilities
emission
probabilities
emission
probabilities
TMHMM
hidden
Markov
model
inside &
outside loop
models,
helix cap
models
HMM for
5-25 aa
helix core
Correctly predicts >90 % of the transmembrane helices
Discriminates between soluble and membrane proteins with
false positive rate ~1%
Krogh et al, J Mol Biol. 305:567-80 (2001)
Packing of secondary structures leads to more complex
3D assemblies (“motifs”):
Tertiary structure
= 3D packing of secondary structural elements
- Hydrophobic residues (Phe, Ile, Leu, Trp) buried in the core
- Core densely packed; not room even for H2O, comparable to a typical crystal
- Core atoms so close that van der Waals bonds contribute significantly
- Charged and polar R groups (e.g., Arg, Lys, Glu, Asp, His) on outside and hydrated
Experimental approaches to protein structure I
X-ray crystallography
crystal of pure
protein
Rotate crystal, collect
amplitudes of diffracted
X-rays as function of
incident angle of X-rays
Find phases of
diffracted X-rays
(by experiment or
computation)
Electrons in crystal diffract X-rays
according to Bragg’s Law: nl = 2d sinq
wavelength
distance between
atomic layers
in crystal
angle of
X-rays
to plane of
atoms
From B. Rupp’s X-ray crystallography intro:
http://www-structure.llnl.gov/Xray/101index.html
With phases &
amplitudes,
Fourier transform to
find distribution of
electrons
(“electron density”)
in protein
Build atomic model into
electron density, refine
Experimental approaches to protein structure II
Nuclear magnetic resonance
protein in
solution
in center
Vary radio wave pulses,
Measure field generated
in response over time
=> function of chemical
environment of each nucleus
very strong
magnet
coils to send/detect
radio waves
Assign identities
to nuclei, measure
distances between
amino acid atoms
Use distance
geometry to solve
for ensemble of
3D structures
consistent with
distance constraints
Basic principle: Atomic nuclei w/ odd mass #’s have spin
==> charged, spinning particles & produce magnetic field
In an external magnetic field, this nuclear magnetic field precesses around an axis
Can observe this process by applying radio wave pulses at
frequencies related to precession frequencies & measuring
the resulting induced electric current
Flemming Poulson, A Brief Introduction to NMR spectroscopy of proteins.
3 broadest classes of protein 3D structures
Fibrous
e.g., collagen
Membrane
e.g, K+ channel
& Globular ...
Examples of globular protein “folds”
all a
a/b
all b
a+b
>24,000 experimentally determined protein structures
stored in PDB database: http://www.rcsb.org/pdb/
Atomic coordinates of a protein structure (PDB format)
- first 3 aa’s = Met-Glu-Ala...
aa type & #
atom # & name
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
N
CA
C
O
CB
CG
SD
CE
N
CA
C
O
CB
CG
CD
OE1
OE2
N
CA
C
O
CB
MET
MET
MET
MET
MET
MET
MET
MET
GLU
GLU
GLU
GLU
GLU
GLU
GLU
GLU
GLU
ALA
ALA
ALA
ALA
ALA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
3
3
3
3
3
atomic coordinates
x
32.632
31.203
30.947
31.741
30.931
29.500
28.784
27.934
29.841
29.498
28.134
28.043
30.533
30.050
30.843
31.432
30.858
27.077
25.751
25.735
25.475
24.678
y
z
-11.712
-12.125
-12.743
-13.533
-13.144
-13.132
-14.774
-14.832
-12.367
-12.881
-12.349
-11.213
-12.408
-12.440
-11.520
-10.532
-11.780
-13.140
-12.666
-12.594
-13.591
-13.608
53.840
53.853
55.207
55.685
52.733
52.189
52.145
53.770
55.822
57.128
57.527
57.995
58.152
59.600
60.513
60.018
61.737
57.353
57.749
59.298
59.986
57.214
occupancy
B-factor
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
63.20
63.20
63.20
63.20
96.70
96.70
96.70
96.70
61.59
61.59
61.59
61.59
51.85
51.85
51.85
51.85
51.85
71.14
71.14
71.14
71.14
34.69
atom type
N
C
C
O
C
C
S
C
N
C
C
O
C
C
C
O
O
N
C
C
O
C
Some of the major computational questions in structural biology
1. How to distinguish membrane proteins from soluble proteins ?
2. How to align protein structures & start organizing them into families, etc. ?
3. How to predict folded protein structure from the linear amino acid sequence?
4. How to identify the active/functional region of the protein from the structure?
5. How to predict the interactions of drugs or other proteins from the structure?
6. How to computationally predict the structural consequences of mutations?
7. How to predict protein function from structure?
8. How to design new or unnatural protein structures?
How to find the best superposition of 2 protein structures?
Note: superimposing 2 structures is easy if you know the equivalent amino acids
-> the hard part is to find this mapping of atoms from 1 structure to the other
Amino acid #
One now-classic approach: DALI
Protein #1 structure
Ca coordinates only
Align sequence #1
to sequence #2
so as to maximize
similarity in
contact patterns
Amino acid #
Calculate matrix of all
pairwise Ca-Ca distances
Repeat for protein # 2
Holm & Sander, J Mol Biol. 233:123-38 (1993)
L
Best structural alignment corresponds to maximizing
L
S   (i, j )
i 1 j 1
i, j = aligned pairs of matched residues
i = iA, iB
j=jA,jB
 = similarity of 2 Ca-Ca distance matrices, dAij and dBij
In the simplest case,
 (i, j )  q  d  d
R
A
ij
B
ij
where dAij and dBij are equivalenced residues in proteins A and B.
and q R = minimum level of similarity
Choose mapping
of residues
(e.g. iA to iB)
to minimize
dAij- dBij
Protein
A
iA
iB
dAij
dBij
jA
Protein
B
jB
The ability to compare structures has led to recognition
of a hierarchy of 3° structures (“folds”)
Class
As organized in the
CATH or SCOP or FSSP
databases:
Architecture
Manual classification at
architecture level,
automated at topology level
Topology
Homologous
Superfamily
H
flavodoxin homologues
Protein Folding
Classic experiment from 1960’s (Chris Anfinsen):
Purified small protein RNaseA,
Refolded in a few minutes in solution
==> all information necessary for correct folding
was captured in the linear amino acids sequence
Corollary:
Proteins do not fold by randomly testing conformations.
Given a 100 amino acid protein,
& 10 possible conformations / amino acids
= 10100 possible conformations for the protein
==> not possible to randomly sample, clearly constrained search
An energetic view of the folding process
Fast Collection of
Large # of
conformationally
similar
different
conformations
molecules
interconverting
Slow
Unique or
small # of final
conformations
free energy
optimize packing
T
U
“hydrophobic
collapse”
M
Unfolded
Molten
globule
Transition
state
F
Folded
folding trajectory
Local secondary
structures form first
Adapted from Branden & Tooze
One long-time goal of biologists/biophysicists:
Solve the Protein Folding Problem =
computationally predict protein 3D fold
from 1D amino acid sequence
Two general approaches:
1st principles/ab initio:
e.g., atomistic molecular dynamics simulations
of proteins, modeling force fields w/ electrostatic,
van der waals forces, solvent, etc. over long time
Empirical:
- fold recognition/threading
- reverses the process: given set of structures,
learn empirical rules that predict folds
Empirical currently more successful at predicting final
structure, but no information about folding trajectory
An example of a successful design of a new protein fold
by a combination of empirical & ab initio structural modeling
designed 93 amino acid
protein with topology
not in PDB dbase
designed model
solved structure
Kuhlman et al, Science, 302:1364-1368 (2003)
The Kuhlman et al. design strategy
Starting model = Choose predefined 3D topology
Assemble 3D model from 3 and 9 amino acid fragments of known structure
==> Generated 172 backbone-only starting models
Initialization
Choose optimal sequence for each starting model using energy function that captures:
12-6 Lennard-Jones potential
orientation-dependent hydrogen bonding term
implicit solvation model
Choose amino acid side chain orientations (“rotamers”) by sampling from known structures
Iterate between:
Optimize choice of amino acid sequence for a fixed backbone conformation
Optimize amino acid backbone coordinates for a fixed sequence
Same energy function used at all stages
Only previous lowest energy sequence/structure optimized at each stage
Final designed sequence not similar to any known protein sequence
Kuhlman et al, Science, 302:1364-1368 (2003)
References
A good introduction to structural biology =
Introduction to Protein Structure
- Carl Branden & John Tooze
Web resources:
Protein Data Bank = > 24,000 protein structures, atomic
coordinates, & the “protein of the month”
http://www.rcsb.org/pdb
CATH/SCOP protein structure hierarchies:
http://www.biochem.ucl.ac.uk/bsm/cath/
http://scop.mrc-lmb.cam.ac.uk/scop/
Several of the illustrations in this tutorial were taken from
Lehninger Principles of Biochemistry, by Nelson & Cox
Part II
Macrophage (“white blood cell”)
“Macrophage and Bacterium 2,000,000X”
Watercolor by David S. Goodsell, 2002
Blood serum
Bacterium
Typical size ranges of known protein structures & assemblies
single
protein
domain
dimeric
protein
aquaporin
(membrane channel)
Ribosome
From a (recommended) review article==>Sali et al. Nature 422:216-225 (2003)
Outline
Part I
What dictates the 3D shape (“fold”) of proteins?
Part II
How do proteins interact with each other in the cell?
4. “Quaternary” structure of proteins
& protein interactions
5. Experimental approaches to determine interactions
- yeast 2 hybrid, mass spectrometry
6. Testing the accuracy of the interactions
7. Moving back to the atomic resolution world
- electron microscopy & tomography
- modeling structures of complexes
Why study interactions?
Proteins interact all the time (e.g., bump into each other non-specifically)
We’re interested in specific interactions
==> e.g., those w/ downstream consequences
For example, consequences might include:
Inducing a change in the structure of an interaction partner
Stabilizing or destabilizing an interaction partner
Modifying the activity of a protein (activate, inhibit, or otherwise regulate)
Cause interaction partner to move to another location
Cut interaction partner
Chemically modify interaction partner (phosphorylate, dephosphorylate,
glycosylate, deglycosylate, ubiquitinate, sumoylate, etc...
==> more than 200 modifications to proteins known, many catalyzed
by other proteins
So, defining interactions helps to define these processes &
their functional consequences
Experimental/Computational methods for observing/inferring protein interactions
Sali et al. Nature 422:216-225 (2003)
X-ray structure
of ATP synthase
Schematic
version
Network
representation
a
b
d
g
b2
e
a
Total set = protein complex
Sum of direct + indirect
interactions
c12
Some methods measure direct
interactions, some indirect
Xenarios & Eisenberg, Curr. Op. Biotech. 12:334-9 (2001)
Interactions between
yeast proteins
Experimental approaches to protein interactions I
Yeast two-hybrid
DBD
“Bait”
DNA binding
domain
+
“Prey”
Act
Transcription
activation
domain
Core transcription
machinery
transcription
operator or
upstream activating
sequence
Reporter gene
Basic idea = screen library of “prey” proteins to test which ones
interact with a given “bait” protein
Fields & Song, Nature 340:245-6 (1989)
Experimental approaches to protein interactions I
High-throughput yeast two-hybrid I
Haploid yeast
cells expressing
activation domainprey fusion proteins
Diploid yeast
probed with
DNA-binding domainPcf11 bait
fusion protein
Uetz et al. Nature 403 (2000)
Uetz et al. Nature 403 (2000)
A second group (Ito et al.), with a related yeast two-hybrid approach, also mapped
a large number of interactions, then compared the interactions w/ the Uetz data:
A surprise at the time
was the apparent inconsistency
among the interaction sets
==> either # of
potential interactions is large
or
false positive rate high
(or both)
Ito et al. PNAS 98:4569-74 (2001)
Experimental approaches to protein interactions II
Mapping complexes by mass spectrometry I
Tag
“Bait”
protein
Interaction
partners
co-purified
with “bait”
493 bait proteins
Affinity
column
SDSpage
protein 1
protein 2
protein 3 Trypsin digest,
protein 4 identify peptides by
protein 5 mass spectrometry
protein 6
3617 “interactions”
Ho et al. Nature 415 (2002)
Experimental approaches to protein interactions I
A variant: Tandem affinity purification (TAP) + Mass spectrometry
Tag1 Tag2
Bait
Affinity
column2
Affinity
column1
SDSpage
+ protease
protein 1
protein 2
protein 3
protein 4
protein 5
protein 6
Trypsin digest,
identify peptides by
mass spectrometry
Affinity
column1
Rigout et al., Nature Biotech. 17:1030-2 (1999)
Gavin et al. Nature 415 (2002)
How accurate are these high-throughput screens?
Can compare to known interactions, but these are incomplete
A different strategy is to identify properties that correlate with interactions
& test versus those properties
Three tests:
1. Comparison of interactions to a reference interaction set
2. Comparison of mRNA co-expression of interacting partners
3. Comparison of functions of predicted interaction partners
Test #1
Estimate accuracy by comparing to a well-determined
reference set of interactions
(tends to underestimate accuracy)
von Mering, Krause et al. Nature May 8, 2002
Test #2
Estimating interaction assay accuracy by
assessing mRNA co-expression of putative interaction partners
Random
Protein
Pairs
True interactions
Estimate % false positives
from observed vs. expected
genes w/ correlated
expression
Correlation coefficient between expression vectors
derived from many DNA microarray experiments
Estimated false positive rates based on this test:
Mrowka et al. Genome Research 11:1971-3 (2001)
A related strategy: fit distribution of co-expression relationships as mix
of those from random & well-characterized interactions
==> Mixture % indicates accuracy.
Deane, Salwinski et al. Mol. Cell. Proteomics (2002)
Estimated true positive rates based on this test
>1 independent expmt
>2 independent expmt
Genome-wide
yeast two-hybrid
At least 1 small-scale expmt
>1 independent experiment
Paralogs also interact
Increasing # of
Interaction Sequence
Tags
Deane, Salwinski et al. Mol. Cell. Proteomics (2002)
Test #3
Estimate accuracy by measuring functional similarity of putative partners
==> in particular, measure tendency to be in same cellular system or process
From literature & pathway databases (KEGG/GO), we know ~1-3000 yeast protein functions:
Swi4
Cdc27
Cell cycle
MAPK signaling pathway
Cell cycle
Ubiquitin-mediated proteolysis
Pathways
of A
Pathways
of B
pw1
pw2
Jaccard coefficient = # pathways in common / # total pathways
S
n pw
pw1
U
1
<pathway similarity> = n
pw2
1
U
pw2
pairs
Systematically test every pair of characterized proteins
Quality of the observed protein-protein interactions
as measured by the pathway overlap test
Small-scale experiments
max agreement
of interacting
proteins’ pathways
Large-scale
yeast two-hybrid
interaction experiments
Date & Marcotte, Nature Biotech. 2003
The various accuracy tests agree to a first approximation
(at least as regards the ranking of accuracies)
Estimated True Positive Rate
via
Co-expression Test set Pathways
Authors
Ito et al.
Ho et al.
Gavin et al.
Uetz et al.
Tong et al.
Method
# interactions Mrowka Deane vonMering Date
Y2H
4081
9%
22%
6-10% ~18%
MS
~3617
~10%
1-3%
MS
~1440
~85%
~10%
Y2H
957
53%
50%
~57%
synthetic lethal
295
~20%
>1 independent experiment
>2 independent experiments
~2000
1167
85%
88%
~30-40% ~87%
~60-70% ~95%
The current highest throughput protein interaction screens:
Yeast
Authors
Ito et al.
Tong et al.
Ho et al.
Gavin et al.
Uetz et al.
Fromont-Racine et al.
Tong et al.
Newman et al.
Method
Y2H
SL
MS
MS
Y2H
Y2H
SL
Y2H
# interactions
4081
~4000
~3617
~1440
957
357
295
152
C. elegans
Li et al.
Walhout et al.
Davy et al.
Y2H
Y2H
Y2H
~4000
148
138
Fly
Giot et al.
Y2H
20,405
Human
Bouwmeester et al.
MS
& several others, including Hepatitis C & H. pylori
Y2H = yeast two hybrid
MS = mass spectrometry
SL = synthetic lethal
221
How many meaningful physical protein-protein
interactions are there?
At a rough estimate:
Human
Yeast
~5,800 genes
~5,800 proteins
x 2-10 interactions/protein
~12,000 - 60,000 interactions
>10-20,000 known,
perhaps ~1/2 correct
==> ~1/3 of the way to a complete map!
~40,000 genes
>>40,000 proteins
x 2-10 interactions/protein
>>80,000 - 400,000 interactions
<5,000 known
==> approx. 1% of
the complete map!
==> We’re a long ways from the complete map of the human “interactome”
Can we relate these interactions back to the protein structure?
==> A growing area of research is combination of low resolution structure
with atomic models to build structures of protein complexes:
For example:
Experimental
or
computational
protein models
Low resolution
electron density
map
from
electron microscopy or
electron tomography
Rough estimate
of atomic model
of protein complex
Example 1 – Electron microscopy of a protein complex
Experimental electron microscopy data
Reconstructed electron density map
of protein complex
Dock atomic
models into
electron
density maps
Sali et al. Nature 422:216-225 (2003)
Example 2- Electron tomography of a protein complex/assembly
Measure projections
of molecules after
illuminating with
electron beam
from different angles
Reconstruct density
distribution (“tomogram”)
as sum of back-projected
densities
Sali et al. Nature 422:216-225 (2003)
Reconstructing cellular organization of molecular complexes by
fitting structures into electron tomograms
“noisy” tomogram
(3D density map)
of single cell
Fit known structures
(“templates”) into density
Sali et al. Nature 422:216-225 (2003)
Some Protein Interaction Resources on the Internet
Protein interaction databases
Biomolecular Interaction Network Database (BIND)
http://www.blueprint.org/bind/bind.php
Currently 73,000 interactions
Database of Interacting Proteins (DIP)
http://dip.doe-mbi.ucla.edu
Currently 44,000 interactions
Protein Quaternary structure database (PSQ)
http://pqs.ebi.ac.uk
Atomic structures of interacting proteins
Interactive visualization of networks
Cytoscape:
http://www.cytoscape.org
Interactive display of protein networks
LGL (Large Graph Layout):
http://bioinformatics.icmb.utexas.edu/LGL
Visualization of networks with up millions of edges, 100,000’s of vertices
Download