BCB 444/544 Lecture 19 #19_Oct05 Protein Structure - Basics

advertisement
BCB 444/544
Lecture 19
A bit of: Protein Structure - Basics
Protein Structure Visualization,
Classification & Comparison
#19_Oct05
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
1
Required Reading
(before lecture)
√Mon Oct 1 - Lecture 17
Protein Motifs & Domain Prediction
• Chp 7 - pp 85-96
√ Wed Oct 3 - Lecture 18
Protein Structure: Basics (Note chg in Lecture Schedule online )
• Chp 12 - pp 173-186
√Thurs Oct 4 & Fri Oct 5 - Lab 6 & Lecture 19
Protein Structure: Basics, Databases, Visualization,
Classification & Comparison
• Chp 13 - pp 187-199
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
2
BCB 544 - Extra Required Reading
Assigned Mon Sept 24
BCB 544 Extra Required Reading Assignment:
for 544 Extra HW#1 Task 2
• Pollard KS, …., Haussler D. (2006) An RNA gene expressed during
cortical development evolved rapidly in humans. Nature 443: 167-172.
• http://www.nature.com/nature/journal/v443/n7108/abs/nature05113.html
doi:10.1038/nature05113
• PDF available on class website - under Required Reading Link
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
3
BCB 544 Projects (Optional for BCB 444)
• For a better idea about what's involved in the Team
Projects, please look over last year's expectations for
projects: http://www.public.iastate.edu/~f2007.com_s.544/project.htm
Please note: wrong URL (instead of that shown above) was included
in originally posted 544ExtraHW#1; corrected version is posted now
• Criteria for evaluation of projects (oral presentations) are
summarized here:
http://www.public.iastate.edu/%7Ef2007.com_s.544/homework/HW7.pdf
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
4
Assignments & Announcements - #1
Students registered for BCB 444: Two Grading Options
1) Take Final Exam per original Grading Policies
2) Instead of taking Final Exam - you may participate
in a Team Research Project
If you choose #2, please do 3 things:
1) Contact Drena (in person)
2) Send email to Michael Terribilini (terrible@iastate.edu)
3) Complete 544 Extra HW#1 - Task 1.1 by noon on Mon Oct 1
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
5
Assignments & Announcements - #2
BCB 444s (Standard):
200 pts
200
100
500 pts
Midterm Exams = 100 points each
Homework & Laboratory assignments = 200 points
Final Exam
Total for BCB 444
BCB 444p (Project):
200 pts
200
190
590 pts
Midterm Exams = 100 points each
Homework & Laboratory assignments = 200 points
Team Research Project
Total for BCB 444p
BCB 544:
200 pts
200
100
200
700 pts
Midterm Exams = 100 points each
Homework & Laboratory assignments
Final Exam
Discussion Questions & Team Research Projects
Total for BCB 544
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
6
Assignments & Announcements #3
ALL: HomeWork #3
Due: Mon Oct 8 by 5 PM
• HW544: HW544Extra #1
√Due: Task 1.1 - Mon Oct 1 by noon
Due: Task 1.2 & Task 2 - Fri Oct 12 by 5 PM (not Monday)
• 444 "Project-instead-of-Final" students should also submit:
• HW544Extra #1
• Due: Task 1.1 - Mon Oct 8 by noon
• Due: Task 1.2 - Fri Oct 12 by 5 PM (not Monday)
Task 2 NOT required!
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
7
QUESTIONS re: HW#3? Due Mon
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
8
This is a new slide
HMM example from Eddy HMM paper:
Toy HMM for Splice Site Prediction
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
9
An HMM for Occasionally Dishonest Casino
Transition probabilities
• Prob(Fair  Loaded) = 0.01
• Prob(Loaded  Fair) = 0.2
But, where do you start? "Begin" state not shown
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
10
Occasionally Dishonest Casino - HW#3
"Begin" state? 50:50 chance of starting with F vs L die
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
11
This slide has been changed
Calculating Different Paths to an
Observed Sequence
Calculations such as those shown below are used to fill a matrix
with probability values for every state at every position
x  x1, x2, x3  6,2,6

 LLL
 (3)  LFL
emission probability
Pr(x,  (1) )  a0F eF (6)aFF eF (2)aFF eF (6)
1
1
1
 0.5   0.99   0.99 
6
6
6
 0.00227
 (1)  FFF
( 2)
transition probability
Pr(x ,  (2) )  a0 LeL (6)aLLeL (2)aLLeL (6)

 0.5  0.5  0.8  0.1  0.8  0.5
 0.008
Pr(x ,  (3) )  a0LeL (6)aLF eF (2)aFL eL (6)aL 0
 0.5  0.5  0.2 
 0.0000417
1
 0.01  0.5
6
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
12
Calculate optimal path? Construct a matrix of
probability values for every state at every residue
How: one way = Viterbi Algorithm
• Initialization (i = 0)
v 0 (0)  1, vk (0)  0 for k  0
• Recursion (i = 1, . . . , L):
For each state k
v k (i )  ek (xi ) max v r (i  1)ark 
r
• Termination:
Pr(x ,  * )  max vk (L)ak 0 
k
To find *, use trace-back, as in dynamic programming
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
13
Viterbi for Calculating Most Probable Path*
x
* Path within HMM that matches query
sequence with highest probability
6
2
1
0
0
0
(1/6)(1/2)
= 1/12
(1/6)max{(1/12)0.99,
(1/4)0.2}
= 0.01375
(1/6)max{0.013750.99,
0.020.2}
= 0.00226875
0
(1/2)(1/2)
= 1/4
(1/10)max{(1/12)0.01,
(1/4)0.8}
= 0.02
(1/2)max{0.013750.01,
0.020.8}
= 0.08

B

F
L
6
0
v k (i )  ek (xi ) max v r (i  1)ark 
r
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
14
Total Probability
Several different paths can result in observation x
Probability that our model will emit x is:
Pr(x )   Pr(x ,  )

BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
15
This slide has bee changed
Calculating the Total Probability:
Note: This not the same as matrix on previous slide!
Here, last column contains sums for each row
x
B

F
L

6
2
1
0
0
0
(1/6)(1/2)
= 1/12
(1/6)sum{(1/12)0.99,
(1/4)0.2}
= 0.022083
(1/6)sum{0.0220830.99,
0.0200830.2}
= 0.004313
0
(1/2)(1/2)
= 1/4
(1/10)sum{(1/12)0.01,
(1/4)0.8}
= 0.020083
(1/2)sum{0.0220830.01,
0.0200830.8}
= 0.008144
Total probability =
Pr(x,  )


6
0
= 0 + 0.004313 + 0.008144 = 0.012
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
16
A few more Details re: Profiles & HMMs
• Smoothing or "Regularization" - method used to avoid "over-fitting"
• Common problem in machine learning (data-driven) approaches
• Limited training sample size causes over-representation of observed
characters while "ignoring" unobserved characters
• Result? Miss members of family not yet sampled
(too many false negative hits)
• Pseudocounts - adding artificial values for 'extra' amino acid(s) not
observed in the training set
• Treated as a 'real' values in calculating probabilities
• Improve predictive power of profiles & HMMs
• Dirichlet mixture - commonly used mathematical model to simulate
the aa distribution in a sequence alignment
• To "correct" problems in an observed alignment based on limited
number of sequences
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
17
Chp 7 - Protein Motifs & Domain Prediction
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 7
Protein Motifs and Domain Prediction
• √Identification of Motifs & Domains in MSAs
• √Motif & Domain Databases Using Regular Expressions
• √Motif & Domain Databases Using Statistical Models
• Protein Family Databases
• Motif Discovery in Unaligned Sequences
• √Sequence Logos
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
18
Motifs & Domains
• Motif - short conserved sequence pattern
• Associated with distinct function in protein or DNA
• Avg = 10 residues (usually 6-20 residues)
• e.g., zinc finger motif - in protein
• e.g., TATA box - in DNA
• Domain - "longer" conserved sequence pattern, defined
as a independent functional and/or structural unit
• Avg = 100 residues (range from 40-700 in proteins)
• e.g., kinase domain or transmembrane domain - in protein
• Domains may (or may not) include motifs
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
19
2 Approaches for Representing "Consensus"
Information in Motifs & Domains
• Regular expression - symbolic representation of
information from MSA
• e.g., protein phosphorylation site motif: [S,T]- X- [R,K]
• Symbols represent specific or unspecified residues, spaces, etc.
• 2 mechanisms for matching:
• Exact
• "Fuzzy" (inexact, approximate) - flexible, more permissive
to detect "near matches"
• Statistical model - includes probability information
derived from MSA
• e.g., PSSM, Profile, or HMM
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
20
Motif & Domain Databases
Based on regular expressions:
• Prosite (Interpro includes Prosite, PRINTS, etc)
• Emofit
Limitation: these don't take probability info into account
Based on statistical models:
•
•
•
•
•
•
•
PRINTS
BLOCKS
ProDom
Pfam
SMART
CDART
Reverse PsiBLAST
• READ your textbook & try some
of these at home; there are
distinct advantages/disadvantages
associated with each
• TAKE HOME LESSON:
Always try several methods!
(not just one!)
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
21
Protein Family Databases
• In addition to databases of "related" protein sequences, based on
shared motifs or domains (Pfam, BLOCKS, CDART), some databases
"cluster" sequences into families based on near full-length sequence
comparisons
• COGs - Clusters of Orthologous Groups (at NCBI)
• Mostly Prokaryotic sequences
• KOG = newer Eukaryotic version
• COGnitor - softwared to search database
• ProtoNet - also clusters of homologous protein sequences
• Advantages: tree-like hierarchical structure
• Provide GO (gene ontology) annotations
• Provides InterPro keywords
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
22
Motif Discovery in Unaligned Sequences
Expectation Maximization - generate"random" alignment of all
sequences, derive PSSM, iteratively match individual sequences to
PSSM to edit & improve it
Problems? Can hit a local optimum (premature convergence)
Sensitive to initial alignment
• MEME - Multiple EM for Motif Elicitation - modified EM, avoids
local optimum issues; two step procedure
Gibbs Sampling - generate "trial" PSSM from random alignment
first, as in EM, but leave one sequence out of initial alignment, then
iteratively match PSSM to left-out sequences
• Gibbs Sampler - web-based motif search via Gibbs sampling
• Not mentioned in textbook:
• Stochastic context-free grammers
• Other "state of the art"pproaches in recent literature, but not
available in web-based servers (yet)
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
23
Chp 12 - Protein Structure Basics
SECTION V
STRUCTURAL BIOINFORMATICS
Xiong: Chp 12
Protein Structure Basics
• LAB 6
• Introduction to Protein DataBank - PDB
• PyMol
• Cn3D?
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
24
Chp 12 - Protein Structure Basics
SECTION V
STRUCTURAL BIOINFORMATICS
Xiong: Chp 12
Protein Structure Basics
•
•
•
•
•
•
•
•
Amino Acids
Peptide Bond Formation
Dihedral Angles
Hierarchy
Secondary Structures
Tertiary Structures
Determination of Protein 3-Dimensional Structure
Protein Structure DataBank (PDB)
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
25
Protein Structure & Function
• Protein structure - primarily determined by sequence
• Protein function - primarily determined by structure
• Globular proteins: compact hydrophobic core &
hydrophilic surface
• Membrane proteins: special hydrophobic surfaces
• Folded proteins are only marginally stable
• Some proteins do not assume a stable "fold" until they bind to
something = Intrinsically disordered
 Predicting protein structure and function can be very hard -- & fun!
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
26
4 Basic Levels of Protein Structure
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
27
Primary & Secondary Structure
• Primary
• Linear sequence of amino acids
• Description of covalent bonds linking aa’s
• Secondary
• Local spatial arrangement of amino acids
• Description of short-range non-covalent interactions
• Periodic structural patterns: -helix, b-sheet
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
28
Tertiary & Quaternary Structure
• Tertiary
• Overall 3-D "fold" of a single polypeptide chain
• Spatial arrangement of 2’ structural elements; packing of
these into compact "domains"
• Description of long-range non-covalent interactions
(plus disulfide bonds)
• Quaternary
• In proteins with > 1 polypeptide chain, spatial
arrangement of subunits
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
29
"Additional" Structural Levels
• Super-secondary elements
• Motifs
• Domains
• Foldons
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
30
Amino Acids
• Each of 20 different amino acids has different "R-Group" or
side chain attached to C
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
31
Peptide Bond is Rigid and Planar
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
32
Hydrophobic Amino Acids
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
33
Charged Amino Acids
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
34
Polar Amino Acids
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
35
Certain Side-chain Configurations are
Energetically Favored (Rotamers)
Ramachandran plot:
"Allowable" psi & phi angles
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
36
Glycine is Smallest Amino Acid
R group = H atom
• Glycine residues increase
backbone flexibility because
they have no R group
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
37
Proline is Cyclic
• Proline residues
reduce flexibility of
polypeptide chain
• Proline cis-trans
isomerization is often a
rate-limiting step in
protein folding
• Recent work suggests
it also may also
regulate ligand binding
in native proteins
Andreotti (BBMB)
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
38
Cysteines can Form Disulfide (S-S) Bonds
• Disulfide bonds
(covalent) stabilize
3-D structures
• In eukaryotes,
disulfide bonds are
often found in secreted
proteins or
extracellular domains
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
39
Globular Proteins Have a Compact
Hydrophobic Core
• Packing of hydrophobic side chains into interior is main
driving force for folding
• Problem? Polypeptide backbone is highly polar
(hydrophilic) due to polar -NH and C=O in each
peptide unit (which are charged at neutral pH=7,
found in biological systems); these polar groups must
be neutralized
• Solution? Form regular secondary structures,
• e.g., -helix, b-sheet, stabilized by H-bonds
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
40
Exterior Surface of Globular Proteins is
Generally Hydrophilic
• Hydrophobic core formed by packed secondary
structural elements provides compact, stable core
• "Functional groups" of protein are attached to this
framework; exterior has more flexible regions (loops)
and polar/charged residues
• Hydrophobic "patches" on protein surface are often
involved in protein-protein interactions
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
41
Protein Secondary Structures
• Helices
• bSheets
• Loops
• Coils
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
42
Helix: Stabilized by H-bonds between
every ~ 4th residue in Backbone
C = black
O = red
N = blue
H = white
Look! - Charges on backbone are "neutralized"
by hydrogen bonds (H-bonds) - red fuzzy vertical bonds
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
43
Certain Amino Acids are "Preferred" &
Others are Rare in Helices
• Ala, Glu, Leu, Met = good helix formers
• Pro, Gly Tyr, Ser = very poor
• Amino acid composition & distribution varies,
depending on on location of helix in 3-D structure
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
44
b-Sheets - also Stabilized by H-bonds
Between Backbone Atoms
Anti-parallel
Parallel
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
45
Loops
Connect helices and sheets
Vary in length and 3-D configurations
Are located on surface of structure
Are more "tolerant" of mutations
Are more flexible and can adopt
multiple conformations
• Tend to have charged and polar amino
acids
• Are frequently components of active
sites
• Some fall into distinct structural
families (e.g., hairpin loops,
reverse turns)
•
•
•
•
•
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
46
Coils
• Regions of 2' structure that are not
helices, sheets, or recognizable turns
• Intrinsically disordered regions appear to
play important functional roles
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
47
Chp 13 - Protein Structure Basics
SECTION V
STRUCTURAL BIOINFORMATICS
Xiong: Chp 13
Protein Structure Visualization, Comparison &
Classfication
• Protein Structural Visualization
• Protein Structure Comparison
• Protein Structure Classification
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
48
Download