Statistical analysis of DNA microarray data

advertisement
Introduction to Proteomics and
Protein Structure Modeling
BMI 705
Kun Huang
Department of Biomedical Informatics
Ohio State University
Review of Protein Structure (5 min)
Introduction to Proteomics (10 min)
Protein Structure Database and Classification
(15 min)
Protein Structure Prediction (15 min)
3-D Alignment (left for next lab session)
Review of Biology – Protein Structure
Obtaining 3-D structure (Computation)
Review of Biology – Protein Structure
Levels of structure
Review of Biology – Protein Topology
Review of Biology – Protein Structure
Obtaining 3-D structure
Review of Biology – Protein Structure
Obtaining 3-D structure (NMR)
Review of Biology – Protein Structure
Obtaining 3-D structure (Bioinformatics)
Review of Biology – Protein Structure
3-D structure (dynamics / computation)
Subdomain Rearrangement in HIV-1 Reverse Transcriptase
Review of Biology – Protein Structure
3-D structure (modulation)
Binding with ligand
Methylation
Phorsphorylation
Glycosylation, ubiquintinization, etc.
Post-Translational Modification (PTM)
PTMs involving addition include:
acetylation, the addition of an acetyl group, usually at the N-terminus of the protein
alkylation, the addition of an alkyl group (e.g. methyl, ethyl)
methylation the addition of a methyl group, usually at lysine or arginine
residues. (This is a type of alkylation.)
biotinylation, acylation of conserved lysine residues with a biotin appendage
glutamylation, covalent linkage of glutamic acid residues to tubulin and some other
proteins.
glycylation, covalent linkage of one to more than 40 glycine residues to the tubulin Cterminal tail
glycosylation, the addition of a glycosyl group to either asparagine, hydroxylysine, serine,
or threonine, resulting in a glycoprotein
isoprenylation, the addition of an isoprenoid group (e.g. farnesol and geranylgeraniol)
lipoylation, attachment of a lipoate functionality
phosphopantetheinylation, the addition of a 4'-phosphopantetheinyl moiety from
coenzyme A, as in fatty acid, polyketide, non-ribosomal peptide and leucine
biosynthesis
phosphorylation, the addition of a phosphate group, usually to serine, tyrosine, threonine
or histidine
sulfation, the addition of a sulfate group to a tyrosine.
Selenation
C-terminal amidation
Post-Translational Modification (PTM)
PTMs involving addition of other proteins or peptides
ISGylation, the covalent linkage to the ISG15 protein (InterferonStimulated Gene 15) (2)
SUMOylation, the covalent linkage to the SUMO protein (Small
Ubiquitin-related MOdifier) (1)
ubiquitination, the covalent linkage to the protein ubiquitin.
PTMs involving changing the chemical nature of amino acids
citrullination, or deimination the conversion of arginine to citrulline
deamidation, the conversion of glutamine to glutamic acid or
asparagine to aspartic acid
Review of Protein Structure (5 min)
Introduction to Proteomics (10 min)
Protein Structure Database and Classification
(15 min)
Protein Structure Prediction (15 min)
3-D Alignment (10 min)
Proteomics
The term proteome was coined by Mark Wilkins in 1995 and is used to
describe the entire complement of proteins in a given biological
organism or system at a given time, i.e. the protein products of the
genome. The term has been applied to several different types of
biological systems. A cellular proteome is the collection of proteins
found in a particular cell type under a particular set of environmental
conditions such as exposure to hormone stimulation.
Proteomics vs. Genomics
The proteome is larger than the genome, especially in eukaryotes, in the
sense that there are more proteins than genes. This is due to alternative
Splicing_(genetics) splicing of genes and post-translational modifications
like glycosylation or phosphorylation.
The proteome has at least two levels of complexity lacking in the genome.
When the genome is defined by the sequence of nucleotides, the proteome
cannot be limited to the sum of the sequences of the proteins present.
Knowledge of the proteome requires knowledge of (1) the structure of the
proteins in the proteome and (2) the functional interaction between the
proteins.
Proteomics Techniques – 2D Gel
Proteomics, the study of the proteome, has largely been practiced through
the separation of proteins by two dimensional gel electrophoresis. In the
first dimension, the proteins are separated by isoelectric focusing, which
resolves proteins on the basis of charge. In the second dimension, proteins
are separated by molecular weight using SDS-PAGE. The gel is dyed with
Coomassie Blue or silver to visualize the proteins. Spots on the gel are
proteins that have migrated to specific locations.
Matching is a big issue
Proteomics Techniques – Mass Spec
Peptide mass fingerprinting identifies a protein by cleaving it into short
peptides and then deduces the protein's identity by matching the observed
peptide masses against a sequence database. Tandem mass
spectrometry, on the other hand, can get sequence information from
individual peptides by isolating them, colliding them with a nonreactive gas,
and then cataloging the fragment ions produced.
Proteomics Techniques – Mass Spec
Proteomics Techniques – Mass Spec
Proteomics Techniques – Microarray
Measures mRNA level, no change in mRNA does not necessarily mean no change
in protein expression and function due to effects of posttranslational modulation.
Review of Protein Structure (5 min)
Introduction to Proteomics (10 min)
Protein Structure Database and Classification
(15 min)
Protein Structure Prediction (15 min)
3-D Alignment (left for next lab)
Protein Databases
UniProt is the universal protein database, a central repository of protein
data created by combining Swiss-Prot, TrEMBL and PIR. This makes it the
world's most comprehensive resource on protein information.
The Protein Information Resource (PIR), located at Georgetown
University Medical Center (GUMC), is an integrated public bioinformatics
resource to support genomic and proteomic research, and scientific
studies.
Swiss-Prot is a curated biological database of protein sequences from
different species created in 1986 by Amos Bairoch during his PhD and
developed by the Swiss Institute of Bioinformatics and the European
Bioinformatics Institute.
Pfam is a large collection of multiple sequence alignments and hidden
Markov models covering many common protein domains and families.
PDB
NCBI
http://proteome.nih.gov/links.html
PubMed – Protein Databases
The Protein database contains sequence data from the translated
coding regions from DNA sequences in GenBank, EMBL, and DDBJ
as well as protein sequences submitted to Protein Information
Resource (PIR), SWISS-PROT, Protein Research Foundation (PRF),
and Protein Data Bank (PDB) (sequences from solved structures).
The Structure database or Molecular Modeling Database (MMDB)
contains experimental data from crystallographic and NMR structure
determinations. The data for MMDB are obtained from the Protein
Data Bank (PDB). The NCBI has cross-linked structural data to
bibliographic information, to the sequence databases, and to the
NCBI taxonomy. Use Cn3D, the NCBI 3D structure viewer, for easy
interactive visualization of molecular structures from Entrez.
Tutorial: http://www.pdb.org/pdbstatic/tutorials/tutorial.html
Example – PDB
http://www.pdb.org
Only proteins with known structures are included.
Example – PDB
Example – PDB
Example – PDB
Protein Visualization Softwares
•
•
•
•
Cn3d
RasMol
TOPS
Chime
•
•
•
•
•
•
DSSP
Molscript
Ribbons
MSMS
Surfnet
…
PubMed Structure Database
PubMed Structure Database
Protein Structure Classification - SCOP
• Structure Classification Of Proteins database
• http://scop.mrc-lmb.cam.ac.uk/scop/
• Hierarchical Clustering
• Family – clear evolutionarily relationship
• Superfamily – probable common evolutionary origin
• Fold – major structural similarity
• Boundaries between levels are more or less
subjective
• Conservative evolutionary classification leads to
many new divisions at the family and superfamily
levels, therefore it is recommended to first focus
on higher levels in the classification tree.
Protein Structure Classification - SCOP
Protein Structure Classification - SCOP
• a/a
• a+b
• b/b
• Misc
• a/b
Protein Structure Classification - SCOP
Scop Classification Statistics
SCOP: Structural Classification of Proteins. 1.69 release
25973 PDB Entries (1 Oct 2004). 70859 Domains. 1 Literature Reference
(excluding nucleic acids and theoretical models)
Number of folds
Number of
superfamilies
Number of
families
All alpha proteins
218
376
608
All beta proteins
144
290
560
Alpha and beta proteins
(a/b)
136
222
629
Alpha and beta proteins
(a+b)
279
409
717
Multi-domain proteins
46
46
61
Membrane and cell
surface proteins
47
88
99
Small proteins
75
108
171
945
1539
2845
Class
Total
Protein Structure Classification - SCOP
Protein Structure Classification - SCOP
Protein Structure Classification - SCOP
Protein Structure Classification - SCOP
Protein Structure Classification - SCOP
Protein Structure Classification - CATH
• CATH Protein Structure Classification
• http://www.cathdb.info/latest/index.html
• CATH is a hierarchical classification of protein domain structures, which
clusters proteins at four major levels, Class(C), Architecture(A), Topology(T)
and Homologous superfamily (H).
• Class, derived from secondary structure content, is assigned for
more than 90% of protein structures automatically.
• Architecture, which describes the gross orientation of secondary
structures, independent of connectivities, is currently assigned
manually.
• The topology level clusters structures into fold groups according
to their topological connections and numbers of secondary
structures.
• The homologous superfamilies cluster proteins with highly
similar structures and functions. The assignments of structures
to fold groups and homologous superfamilies are made by
sequence and structure comparisons.
Protein Structure Classification - CATH
CATH vs. SCOP
Protein Fold Space Map
Similarity – DALI score
Distance Matrix
Embedding in 3-D space
(multiple dimensional scaling)
Kim, PNAS, Mar 4, 2003
Review of Protein Structure (5 min)
Introduction to Proteomics (10 min)
Protein Structure Database and Classification
(15 min)
Protein Structure Prediction (15 min)
3-D Alignment (Left for next lab)
Secondary Structure Prediction
AGADIR - An algorithm to predict the helical content of peptides
APSSP - Advanced Protein Secondary Structure Prediction Server
GOR - Garnier et al, 1996
HNN - Hierarchical Neural Network method (Guermeur, 1997)
Jpred - A consensus method for protein secondary structure prediction
at University of Dundee
JUFO - Protein secondary structure prediction from sequence (neural
network)
nnPredict - University of California at San Francisco (UCSF)
Porter - University College Dublin
PredictProtein - PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader,
MaxHom, EvalSec from Columbia University
Prof - Cascaded Multiple Classifiers for Secondary Structure Prediction
PSA - BioMolecular Engineering Research Center (BMERC) / Boston
PSIpred - Various protein structure prediction methods at Brunel
University
SOPMA - Geourjon and Deléage, 1995
SSpro - Secondary structure prediction using bidirectional recurrent
neural networks at University of California
DLP - Domain linker prediction at RIKEN
http://us.expasy.org/tools/#secondary
Secondary Structure Prediction - HNN
• http://npsa-pbil.ibcp.fr/cgi-bin/secpred_hnn.pl
• >gi|78099986|sp|P0ABK2|CYDB_ECOLI Cytochrome d ubiquinol oxidase subunit 2
(Cytochrome d ubiquinol oxidase subunit II) (Cytochrome bd-I oxidase
subunit II)
MIDYEVLRFIWWLLVGVLLIGFAVTDGFDMGVGMLTRFLGRNDTERRIMINSIAPHWDGNQVWLITAGGA
LFAAWPMVYAAAFSGFYVAMILVLASLFFRPVGFDYRSKIEETRWRNMWDWGIFIGSFVPPLVIGVAFGN
LLQGVPFNVDEYLRLYYTGNFFQLLNPFGLLAGVVSVGMIITQGATYLQMRTVGELHLRTRATAQVAALV
TLVCFALAGVWVMYGIDGYVVKSTMDHYAASNPLNKEVVREAGAWLVNFNNTPILWAIPALGVVLPLLTI
LTARMDKAAWAFVFSSLTLACIILTAGIAMFPFVMPSSTMMNASLTMWDATSSQLTLNVMTWVAVVLVPIILLY
TAWCYWKMFGRITKEDIERNTHSLY
Secondary Structure Prediction - HNN
Sequence length : 379
HNN :
Alpha helix (Hh) : 209 is 55.15%
310 helix (Gg) : 0 is 0.00%
Pi helix (Ii) : 0 is 0.00%
Beta bridge (Bb) : 0 is 0.00%
Extended strand (Ee) : 55 is 14.51%
Beta turn (Tt) : 0 is 0.00%
Bend region (Ss) : 0 is 0.00%
Random coil (Cc) : 115 is 30.34%
Ambigous states (?) : 0 is 0.00%
Other states : 0 is 0.00%
10
20
30
40
50
60
70
|
|
|
|
|
|
|
MIDYEVLRFIWWLLVGVLLIGFAVTDGFDMGVGMLTRFLGRNDTERRIMINSIAPHWDGNQVWLITAGGA
ccchhhhhhhhhhhhhhheeeeehccchhcchhhhhheecccccceeeeeeccccccccceeeeeeccch
LFAAWPMVYAAAFSGFYVAMILVLASLFFRPVGFDYRSKIEETRWRNMWDWGIFIGSFVPPLVIGVAFGN
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhcccccccccchhhhhhhhhhcceeehccchccheehhhhhc
LLQGVPFNVDEYLRLYYTGNFFQLLNPFGLLAGVVSVGMIITQGATYLQMRTVGELHLRTRATAQVAALV
hhcccccchhhhheeeeccchhhhhcchceccceeeeeeeeeccchhhhhhhchhhhhhchhhhhhhhhh
TLVCFALAGVWVMYGIDGYVVKSTMDHYAASNPLNKEVVREAGAWLVNFNNTPILWAIPALGVVLPLLTI
hhhhhhccceeeeeeccceeeeeccccccccccchhhhhhhhhhhheeccccceeeeccchhhhhhhhhh
LTARMDKAAWAFVFSSLTLACIILTAGIAMFPFVMPSSTMMNASLTMWDATSSQLTLNVMTWVAVVLVPI
hhhhhhhhhhhhhhhhhhhhhhhhhcchhhcccccccchhhccccchhcccchhhhhhhhhhhhhhhhhh
ILLYTAWCYWKMFGRITKEDIERNTHSLY
hhhhhhhhhhhhhhhcchhhhhhhccccc
Secondary Structure Prediction - HNN
Motifs Readily Identified from Sequence
• Zinc Finger - order and spacing of a pattern for cysteine and
histidine.
• Leucine zippers – two antiparallel alpha helices held together by
interactions between hybrophobic leucine residues at every
seventh position in each helix.
• Coiled coils – 2-3 helices coiled around each other in a lefthanded supercoil (3.5 residue/turn instead of 3.6 – 7/two
turns); first and fourth are always hydrophobic, others
hydrophilic; 5-10 heptads.
• Transmembrane-spanning proteins – alpha helices comprising
amino acids with hydrophobic side chains, typically 20-30
residues.
Topology Prediction
PSORT - Prediction of protein subcellular localization
TargetP - Prediction of subcellular location
DAS - Prediction of transmembrane regions in prokaryotes using the Dense
Alignment Surface method (Stockholm University)
HMMTOP - Prediction of transmembrane helices and topology of proteins
(Hungarian Academy of Sciences)
PredictProtein - Prediction of transmembrane helix location and topology
(Columbia University)
SOSUI - Prediction of transmembrane regions (Nagoya University, Japan)
TMAP - Transmembrane detection based on multiple sequence alignment
(Karolinska Institut; Sweden)
TMHMM - Prediction of transmembrane helices in proteins (CBS; Denmark)
TMpred - Prediction of transmembrane regions and protein orientation (EMBnetCH)
TopPred - Topology prediction of membrane proteins (France)
http://us.expasy.org/tools
Transmembrane Helix - TMHMM
• http://www.cbs.dtu.dk/services/TMHMM-2.0/
• >gi|78099986|sp|P0ABK2|CYDB_ECOLI Cytochrome d ubiquinol oxidase subunit 2
(Cytochrome d ubiquinol oxidase subunit II) (Cytochrome bd-I oxidase
subunit II)
MIDYEVLRFIWWLLVGVLLIGFAVTDGFDMGVGMLTRFLGRNDTERRIMINSIAPHWDGNQVWLITAGGA
LFAAWPMVYAAAFSGFYVAMILVLASLFFRPVGFDYRSKIEETRWRNMWDWGIFIGSFVPPLVIGVAFGN
LLQGVPFNVDEYLRLYYTGNFFQLLNPFGLLAGVVSVGMIITQGATYLQMRTVGELHLRTRATAQVAALV
TLVCFALAGVWVMYGIDGYVVKSTMDHYAASNPLNKEVVREAGAWLVNFNNTPILWAIPALGVVLPLLTI
LTARMDKAAWAFVFSSLTLACIILTAGIAMFPFVMPSSTMMNASLTMWDATSSQLTLNVMTWVAVVLVPIILLY
TAWCYWKMFGRITKEDIERNTHSLY
Transmembrane Helix - TMHMM
# gi_78099986_sp_P0ABK2_CYDB_ECOLI Length: 379 #
gi_78099986_sp_P0ABK2_CYDB_ECOLI Number of predicted TMHs: 8 #
gi_78099986_sp_P0ABK2_CYDB_ECOLI Exp number of AAs in TMHs:
177.07249 # gi_78099986_sp_P0ABK2_CYDB_ECOLI Exp number, first 60
AAs: 20.62396 # gi_78099986_sp_P0ABK2_CYDB_ECOLI Total prob of Nin: 0.94585 # gi_78099986_sp_P0ABK2_CYDB_ECOLI POSSIBLE N-term
signal sequence gi_78099986_sp_P0ABK2_CYDB_ECOLI TMHMM2.0 inside 1
6 gi_78099986_sp_P0ABK2_CYDB_ECOLI TMHMM2.0 TMhelix 7 24
gi_78099986_sp_P0ABK2_CYDB_ECOLI TMHMM2.0 outside 25 76
gi_78099986_sp_P0ABK2_CYDB_ECOLI TMHMM2.0 TMhelix 77 99
gi_78099986_sp_P0ABK2_CYDB_ECOLI TMHMM2.0 inside 100 122
gi_78099986_sp_P0ABK2_CYDB_ECOLI TMHMM2.0 TMhelix 123 145
gi_78099986_sp_P0ABK2_CYDB_ECOLI TMHMM2.0 outside 146 159
gi_78099986_sp_P0ABK2_CYDB_ECOLI TMHMM2.0 TMhelix 160 182
gi_78099986_sp_P0ABK2_CYDB_ECOLI TMHMM2.0 inside 183 202
gi_78099986_sp_P0ABK2_CYDB_ECOLI TMHMM2.0 TMhelix 203 225
gi_78099986_sp_P0ABK2_CYDB_ECOLI TMHMM2.0 outside 226 261
gi_78099986_sp_P0ABK2_CYDB_ECOLI TMHMM2.0 TMhelix 262 281
gi_78099986_sp_P0ABK2_CYDB_ECOLI TMHMM2.0 inside 282 292
gi_78099986_sp_P0ABK2_CYDB_ECOLI TMHMM2.0 TMhelix 293 315
gi_78099986_sp_P0ABK2_CYDB_ECOLI TMHMM2.0 outside 316 334
gi_78099986_sp_P0ABK2_CYDB_ECOLI TMHMM2.0 TMhelix 335 357
gi_78099986_sp_P0ABK2_CYDB_ECOLI TMHMM2.0 inside 358 379
Tertiary Structure Prediction
Comparative modeling
SWISS-MODEL - An automated knowledge-based protein modelling server
3Djigsaw - Three-dimensional models for proteins based on homologues of
known structure
CPHmodels - Automated neural-network based protein modelling server
ESyPred3D - Automated homology modeling program using neural networks
Geno3d - Automatic modeling of protein three-dimensional structure
SDSC1 - Protein Structure Homology Modeling Server
Threading
3D-PSSM - Protein fold recognition using 1D and 3D sequence profiles
coupled with secondary structure information (Foldfit)
Fugue - Sequence-structure homology recognition
HHpred - Protein homology detection and structure prediction by HMM-HMM
comparison
Libellula - Neural network approach to evaluate fold recognition results
LOOPP - Sequence to sequence, sequence to structure, and structure to
structure alignment
SAM-T02 - HMM-based Protein Structure Prediction
Threader - Protein fold recognition
ProSup - Protein structure superimposition
SWEET - Constructing 3D models of saccharides from their sequences
Ab initio
HMMSTR/Rosetta - Prediction of protein structure from sequence
http://us.expasy.org/tools
Tertiary Structure Prediction
Comparative modeling
3Djigsaw - Three-dimensional models for proteins based on homologues of
known structure
Contreras-Moreira,B., Bates,P.A. (2002)
Domain Fishing: a first step in protein
comparative modelling. Bioinformatics
18: 1141-1142.
Tertiary Structure Prediction
Threading
3D-PSSM - Protein fold recognition using 1D and 3D sequence profiles
coupled with secondary structure information (Foldfit)
Fugue - Sequence-structure homology recognition
HHpred - Protein homology detection and structure prediction by HMM-HMM
comparison
Libellula - Neural network approach to evaluate fold recognition results
LOOPP - Sequence to sequence, sequence to structure, and structure to
structure alignment
SAM-T02 - HMM-based Protein Structure Prediction
Threader - Protein fold recognition
ProSup - Protein structure superimposition
SWEET - Constructing 3D models of saccharides from their sequences
Tertiary Structure Prediction
Threading
The term threading was first coined by Jones, Taylor and Thornton
in 1992, and originally referred specifically to the use of a full 3-D
structure atomic representation of the protein template in fold
recognition. Today, the terms threading and fold recognition are
frequently (though somewhat incorrectly) used interchangeably.
The basic idea is that the target sequence (the protein sequence for
which the structure is being predicted) is threaded through the
backbone structures of a collection of template proteins (known as
the fold library) and a “goodness of fit” score calculated for each
sequence-structure alignment. This goodness of fit is often derived
in terms of an empirical energy function, based on statistics derived
from known protein structures, but many other scoring functions
have been proposed and tried over the years.
Threading methods share some of the characteristics of both
comparative modelling methods (the sequence alignment aspect)
and ab initio prediction methods (predicting structure based on
identifying low-energy conformations of the target protein).
http://en.wikipedia.org/wiki/Threading_%28protein_sequence%29
Tertiary Structure Prediction
Ab initio (de novo)
• From scratch – using physical property instead of known
structures
• Mimic folding process – minimize certain energy function,
stochastic modeling (e.g., simulated annealing)
• Computationally expensive – requires large clusters, large
machines (e.g., IBM BlueGene) or distributed computing,
currently only work for small peptides
• Big potential in the future – understand the dynamics,
accuracy, and applications in drug development
Tertiary Structure Prediction
Ab initio (de novo)
Prediction Scoring with Rosetta
Rosetta uses a scoring function to judge different
conformations. The process consists of making
'moves' (changing the bond angles of a particular
group of amino acids) and then scoring the new
conformation.
The Rosetta score is a weighted sum of component
scores, where each component score is judging a
different aspect of protein structure.
Environment score: Here, hydrophobic residues as
represented as orange stars, so the left
conformation is good (all the hydrophobics
together) while the rightmost conformation is
bad (with the hydrophobic amino acids not
touching).
Pair-score: Two conformations of a polypeptide are
shown, one (top) where the chain is folded back
on itself bringing two cysteins together
(yellow+yellow = possible disulphide bond) and
forming a salt-bridge (blue+red = opposites
attract). The conformation at bottom does not
make these pairings and the pair-score would,
thus, favor the top conformation.
http://www.grid.org/projects/hpf/howitworks_scoring.htm
Evaluation - CASP
CASP - Critical Assessment of Techniques for Protein Structure Prediction, is a
community-wide experiment (though it is commonly referred to as a
competition) for protein structure prediction taking place every two years
since 1994. (http://predictioncenter.org/)
The main goal of CASP is to obtain an in-depth and objective assessment of
our current abilities and inabilities in the area of protein structure
prediction. To this end, participants will predict as much as possible about
a set of soon to be known structures. These will be true predictions, not
‘post-dictions’ made on already known structures. CASP7 will particularly
address the following questions:
1. Are the models produced similar to the corresponding experimental
structure?
2. Is the mapping of the target sequence onto the proposed structure (i.e. the
alignment) correct?
3. Have similar structures that a model can be based on been identified?
4. Are comparative models more accurate than can be obtained by simply
copying the best template?
5. Has there been progress from the earlier CASPs?
6. What methods are most effective?
7. Where can future effort be most productively focused?
Evaluation - CASP
Evaluation of the results is carried out in the following prediction categories:
• tertiary structure prediction (all CASPs)
• secondary structure prediction (dropped after CASP5)
• prediction of structure complexes (CASP2 only; a separate experiment CAPRI - carries on this subject)
• residue-residue contact prediction (starting CASP4)
• disordered regions prediction (starting CASP5)
• domain boundary prediction (starting CASP6)
• function prediction (starting CASP6)
• model quality assessment (starting CASP7)
• model refinement (starting CASP7)
Tertiary structure prediction category was further subdivided into
• homology modelling
• fold recognition (also called protein threading; Note, this is incorrect as
threading is a method)
• de novo structure prediction Now referred to as 'New Fold' as many
methods apply evaluation, or scoring, functions that are biased by
knowledge of native protein structures, such an example would be an
artificial neural network.
Evaluation - CASP
Number of human expert groups registered
207
Number of targets released
104
Number of prediction servers registered
98
Targets canceled
4
Valid targets
100
Refinement targets
9
Number of groups
contributing
Number of models
designated as 1
Total number of
models
180
12393
48339
Alignments to PDB
structures
15
966
3896
Residue-residue
contacts
17
1473
1561
Structural domains
assignments
27
2258
2515
Disordered regions
19
1801
1801
Function prediction
22
1317
1930
Quality assessment
29
2326
3228
Model refinement
26
136
447
255 (unique)
22670
63717
Prediction format
3D coordinates
All
Review of Protein Structure (5 min)
Introduction to Proteomics (10 min)
Protein Structure Database and Classification
(15 min)
Protein Structure Prediction (15 min)
3-D Alignment (Left for next lab)
Why Align Structures
1. For homologous proteins (similar ancestry), this
provides the “gold standard” for sequence alignment
– elucidates the common ancestry of the proteins.
2. For nonhomologous proteins, allows us to identify
common substructures of interest.
3. Allows us to classify proteins into clusters, based on
structural similarity.
Sequence/Structure Homology
• The existence of large numbers of remote
homologs shows us that true structural
similarity is hard to see in the primary amino
acid sequence
• Structural conservation is stronger than
sequence conservation
Remote Homology
• Remote homologs sometimes conserve function
(all SH3-like domains bind peptides), and often
conserve active site locations (TIM barrels active
sites are at the ends of the barrels).
• Remote homologs probably are evolutionarily
related and fold using the same folding pathway.
Example of Structural Homologs
4DFR: Dihydrofolate reductase
1YAC: Octameric Hydrolase of Unknown
Specificity
5.9% sequence identity (best alignment)
1YAC structure solved without knowing
function.
Alignment to 4DFR and others implies it is a
hydrolase of some sort.
Example of Structural Homologs
DHFR:
yellow & orange
YAC:
green & purple
Sheets
only
Helices
only
Example of Structural Homologs
Sequence alignment
SLSAAEADLAGKSWAPVFANKNANGLDFLVALFEKFPDSANFFADFK-GKSVADIKA-S
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
PKLRDVSSRIFTRLNEFVNNAANAGKMSAMLSQFAKEHVGFGVGSAQFENVRSMFPGFVA
KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP
Structural alignment
XSLSAAEADLAGKSW-APVFANKN-ANGLDFLVALFEKFPDSANFF-ADFKGKSVA—-DIK
V-LSPADKTNVKAAWGK-VGAHA-GEYGAEALERMFLSFPTTKTYFPHF-------DLS-H
ASPKLRDVSSRIFTRLNEFVNNAANAGKMSA-MLSQ-FAKEHV-GFGVGSAQFENVRSM-F
GSAQVKGHGKKVADALTNAVAHV-D—-DMPNAL—-SALSDLHAHKLRVDPVNFKLLS-HCL
PGFVA
LVTLAAHLPAEFTP
How to Align Structures
1. Visual inspection (by eye)
2. Computational approach
• Point-based methods using point distances and
other properties to establish correspondences
• Secondary structure-based methods use vectors
representing secondary structures to establish
correspondences.
Global versus Local
Global alignment
Local Alignment
motif
Structural Alignment Algorithms
Alignment algorithms create a one-to-one mapping
of subset(s) of one sequence to subset(s) of another
sequence.
Structure-based alignment algorithms do this by
minimizing the structure difference score or rootmean-square difference (rmsd) in alpha-carbon
positions.
The Problem Is: we don’t know the alignment.
Structure-based alignment programs determine the
alignment that minimizes the rmsd.
Evaluating Structural Alignments
•
•
•
•
•
•
# of aligned residues
Percent identity in aligned residues
# of gaps
Size of two proteins
Conservation of known active site environments
RMSD (root mean square deviation) of corresponding
residues
• Dihedral angle difference …
• No universal criterion
• Application dependent
Comparing dihedral angles
Torsion angles (f,y) are:
- local by nature (error propagation)
- invariant upon rotation and translation of the
molecule
- compact (O(n) angles for a protein of n residues)
Add 1 degree
To all f, y
Structural Alignments Methods
• STRUCTAL [Levitt, Subbiah, Gerstein]
•
Using dynamic programming with a distance
metric
• DALI [Holm, Sander]
•
Analysis of distance maps
• LOCK [Singh, Brutlag]
•
Analysis of secondary structure vectors,
followed by refinement with distances
• SSAP [Orengo and Taylor, 1989]
• VAST [Gibrat et al., 1996]
• CE [Shindyalov and Bourne, 1998]
• SSM [Krissinel and Henrik, 2004]
• …
Least Squares Superposition
Problem: find the rotation matrix, R and a
vector, v, that minimize the following quantity:
Where xi are the coordinates from one molecule
and yi are the equivalent* coordinates from
another molecule.
*equivalent based on alignment
Two Subproblems
•
Find correspondence set
•
Find alignment transform
(protein superposition problem)
•
Chicken-and-egg
DALI (Distance ALIgnment)
• DALI has been used to do an ALL vs. ALL comparison
of proteins in the PDB, and to create a hierarchical
clustering of families.
• http://www.ebi.ac.uk/dali/
• FSSP = fold classification based on structurestructure alignment of proteins
• http://ekhidna.biocenter.helsinki.fi/dali/start
VAST (Vector Alignment Search Tool)
• It places great emphasis on the definition of the threshold of significant
structural similarity. By focusing on similarities that are surprising in the
statistical sense, one does not waste time examining many similarities of small
substructures that occur by chance in protein structure comparison. Very many
of the remaining similarities are examples of remote homology, often
undetectable by sequence comparison. As such they may provide a broader
view of the structure, function and evolution of a protein family.
• At the heart of VAST's significance calculation is definition of the "unit" of
tertiary structure similarity as pairs of secondary structure elements (SSE's)
that have similar type, relative orientation, and connectivity. In comparing two
protein domains the most surprising substructure similarity is that where the
sum of superposition scores across these "units" is greatest. The likelihood
that this similarity would be seen by chance is then given as a simple product:
the probability that one would obtain this score in drawing so many "units" at
random, times the number of alternative SSE-pair combinations possible in the
domain comparison, from which one has chosen the best.
• http://www.ncbi.nlm.nih.gov/Structure/RESEARCH/iucrabs.html#Ref_6
Download