#19 - Protein Structure Basics & 10/5/07 Classification BCB 444/544

advertisement
#19 - Protein Structure Basics &
Classification
10/5/07
Required Reading
BCB 444/544
(before lecture)
√ Mon Oct 1 - Lecture 17
Lecture 19
Protein Motifs & Domain Prediction
• Chp 7 - pp 85-96
A bit of: Protein Structure - Basics
√ Wed Oct 3 - Lecture 18
Protein Structure: Basics (Note chg in Lecture Schedule online )
• Chp 12 - pp 173-186
Protein Structure Visualization,
Classification & Comparison
√ Thurs Oct 4 & Fri Oct 5 - Lab 6 & Lecture 19
Protein Structure: Basics, Databases, Visualization,
#19_Oct05
Classification & Comparison
• Chp 13 - pp 187-199
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
1
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
BCB 544 - Extra Required Reading
• For a better idea about what's involved in the Team
Projects, please look over last year's expectations for
projects: http://www.public.iastate.edu/~f2007.com_s.544/project.htm
BCB 544 Extra Required Reading Assignment:
for 544 Extra HW#1 Task 2
Please note: wrong URL (instead of that shown above) was included
in originally posted 544ExtraHW#1; corrected version is posted now
• Pollard KS, …., Haussler D. (2006) An RNA gene expressed during
cortical development evolved rapidly in humans. Nature 443: 167-172.
• http://www.nature.com/nature/journal/v443/n7108/abs/nature05113.html
doi:10.1038/nature05113
• Criteria for evaluation of projects (oral presentations) are
summarized here:
• PDF available on class website - under Required Reading Link
10/5/07
http://www.public.iastate.edu/%7Ef2007.com_s.544/homework/HW7.pdf
3
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Assignments & Announcements - #1
1) Take Final Exam per original Grading Policies
2) Instead of taking Final Exam - you may participate
in a Team Research Project
If you choose #2, please do 3 things:
1) Contact Drena (in person)
2) Send email to Michael Terribilini (terrible@iastate.edu)
3) Complete 544 Extra HW#1 - Task 1.1 by noon on Mon Oct 1
BCB 444/544 Fall 07 Dobbs
10/5/07
4
10/5/07
6
Assignments & Announcements - #2
Students registered for BCB 444: Two Grading Options
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
2
BCB 544 Projects (Optional for BCB 444)
Assigned Mon Sept 24
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
10/5/07
BCB 444s (Standard):
200 pts
200
100
500 pts
Midterm Exams = 100 points each
Homework & Laboratory assignments = 200 points
Final Exam
Total for BCB 444
BCB 444p (Project):
200 pts
200
190
590 pts
Midterm Exams = 100 points each
Homework & Laboratory assignments = 200 points
Team Research Project
Total for BCB 444p
BCB 544:
5
200 pts
200
100
200
700 pts
Midterm Exams = 100 points each
Homework & Laboratory assignments
Final Exam
Discussion Questions & Team Research Projects
Total for BCB 544
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
1
#19 - Protein Structure Basics &
Classification
10/5/07
QUESTIONS re: HW#3? Due Mon
Assignments & Announcements #3
ALL: HomeWork #3
Due: Mon Oct 8 by 5 PM
• HW544: HW544Extra #1
√ Due: Task 1.1 - Mon Oct 1 by noon
Due: Task 1.2 & Task 2 - Fri Oct 12 by 5 PM (not Monday)
• 444 "Project-instead-of-Final" students should also submit:
• HW544Extra #1
• Due: Task 1.1 - Mon Oct 8 by noon
• Due: Task 1.2 - Fri Oct 12 by 5 PM ( not Monday)
Task 2 NOT required!
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
7
This is a new slide
HMM example from Eddy HMM paper:
Toy HMM for Splice Site Prediction
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
8
An HMM for Occasionally Dishonest Casino
Transition probabilities
• Prob(Fair → Loaded) = 0.01
• Prob(Loaded → Fair) = 0.2
But, where do you start? "Begin" state not shown
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
9
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
10
This slide has been changed
Occasionally Dishonest Casino - HW#3
Calculating Different Paths to an
Observed Sequence
Calculations such as those shown below are used to fill a matrix
with probability values for every state at every position
transition probability
x = x1 , x 2 , x 3 = 6,2,6
Pr(x, "
emission probability
(1)
! (1) = FFF
) = a0F eF (6)aFF eF (2)aFF eF (6)
1
1
1
= 0.5 # # 0.99 # # 0.99 #
6
6
6
$ 0.00227
Pr(x , " (2) ) = a0 LeL (6)aLLeL (2)aLLeL (6)
! (2) = LLL
= 0.5 ! 0.5 ! 0.8 ! 0.1 ! 0.8 ! 0.5
!
"Begin" state? 50:50 chance of starting with F vs L die
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
BCB 444/544 Fall 07 Dobbs
10/5/07
! (3) = LFL
11
= 0.008
Pr(x , #
( 3)
) = a0 LeL (6)aLF eF (2)aFLeL (6)aL 0
= 0.5 " 0.5 " 0.2 "
! 0.0000417
1
" 0.01 " 0.5
6
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
12
2
#19 - Protein Structure Basics &
Classification
10/5/07
Calculate optimal path? Construct a matrix of
Viterbi for Calculating Most Probable Path*
probability values for every state at every residue
x
* Path within HMM that matches query
sequence with highest probability
How: one way = Viterbi Algorithm
ε
• Initialization (i = 0)
v 0 (0) = 1, v k (0) = 0 for k > 0
• Recursion (i = 1, . . . , L ):
π
For each state k
v k (i ) = ek (xi ) max(v r (i ! 1)ark )
r
6
2
0
k
1
0
F
0
(1/6)×(1/2)
= 1/12
(1/6)×max{(1/12)×0.99,
(1/4)×0.2}
= 0.01375
(1/6)×max{0.01375×0.99,
0.02×0.2}
= 0.00226875
L
0
(1/2)×(1/2)
= 1/4
(1/10)×max{(1/12)×0.01,
(1/4)×0.8}
= 0.02
(1/2)×max{0.01375×0.01,
0.02×0.8}
= 0.08
v k (i ) = ek (xi ) max(v r (i ! 1)ark )
r
To find π*, use trace-back, as in dynamic programming
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
0
B
• Termination:
Pr(x , ! * ) = max(v k (L)ak 0 )
6
13
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
14
This slide has bee changed
Total Probability
Calculating the Total Probability:
Note: This not the same as matrix on previous slide!
Here, last column contains sums for each row
Several different paths can result in observation x
x
ε
6
2
B
1
0
0
F
0
(1/6)×(1/2)
= 1/12
(1/6)×sum{(1/12)×0.99,
(1/4)×0.2}
= 0.022083
(1/6)×sum{0.022083×0.99,
0.020083×0.2}
= 0.004313
L
0
(1/2)×(1/2)
= 1/4
(1/10)×sum{(1/12)×0.01,
(1/4)×0.8}
= 0.020083
(1/2)×sum{0.022083×0.01,
0.020083×0.8}
= 0.008144
Probability that our model will emit x is:
Pr(x ) = ! Pr(x , " )
"
π
Total probability =
! Pr(x, " )
6
0
= 0 + 0.004313 + 0.008144 = 0.012
"
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
15
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
16
Chp 7 - Protein Motifs & Domain Prediction
A few more Details re: Profiles & HMMs
• Smoothing or "Regularization" - method used to avoid "over-fitting"
SECTION II
• Common problem in machine learning (data-driven) approaches
• Limited training sample size causes over-representation of observed
characters while "ignoring" unobserved characters
SEQUENCE ALIGNMENT
Xiong: Chp 7
Protein Motifs and Domain Prediction
• Result? Miss members of family not yet sampled
• √ Identification of Motifs & Domains in MSAs
• √ Motif & Domain Databases Using Regular Expressions
• √ Motif & Domain Databases Using Statistical Models
(too many false negative hits)
• Pseudocounts - adding artificial values for 'extra' amino acid(s) not
observed in the training set
• Protein Family Databases
• Motif Discovery in Unaligned Sequences
• Treated as a 'real' values in calculating probabilities
• Improve predictive power of profiles & HMMs
• Dirichlet mixture - commonly used mathematical model to simulate
the aa distribution in a sequence alignment
• √ Sequence Logos
• To "correct" problems in an observed alignment based on limited
number of sequences
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
BCB 444/544 Fall 07 Dobbs
10/5/07
17
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
18
3
#19 - Protein Structure Basics &
Classification
10/5/07
2 Approaches for Representing "Consensus"
Information in Motifs & Domains
Motifs & Domains
• Motif - short conserved sequence pattern
• Regular expression - symbolic representation of
information from MSA
• Associated with distinct function in protein or DNA
• Avg = 10 residues (usually 6-20 residues)
• e.g., zinc finger motif - in protein
• e.g., TATA box - in DNA
• e.g., protein phosphorylation site motif: [S,T]- X- [R,K]
• Symbols represent specific or unspecified residues, spaces, etc.
• 2 mechanisms for matching:
• Exact
• "Fuzzy" (inexact, approximate) - flexible, more permissive
to detect "near matches"
• Domain - "longer" conserved sequence pattern, defined
as a independent functional and/or structural unit
• Statistical model - includes probability information
derived from MSA
• Avg = 100 residues (range from 40-700 in proteins)
• e.g., kinase domain or transmembrane domain - in protein
• e.g., PSSM, Profile, or HMM
• Domains may (or may not) include motifs
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
19
Motif & Domain Databases
20
• In addition to databases of "related" protein sequences, based on
shared motifs or domains (Pfam, BLOCKS, CDART), some databases
"cluster" sequences into families based on near full-length sequence
comparisons
• Prosite (Interpro includes Prosite, PRINTS, etc)
• Emofit
Limitation: these don't take probability info into account
Based on statistical models:
PRINTS
BLOCKS
ProDom
Pfam
SMART
CDART
Reverse PsiBLAST
10/5/07
Protein Family Databases
Based on regular expressions:
•
•
•
•
•
•
•
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
• COGs - Clusters of Orthologous Groups (at NCBI)
• Mostly Prokaryotic sequences
• KOG = newer Eukaryotic version
• COGnitor - softwared to search database
• READ your textbook & try some
of these at home; there are
distinct advantages/disadvantages
associated with each
• ProtoNet - also clusters of homologous protein sequences
• TAKE HOME LESSON:
Always try several methods!
(not just one!)
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
• Advantages: tree-like hierarchical structure
• Provide GO (gene ontology) annotations
• Provides InterPro keywords
21
Motif Discovery in Unaligned Sequences
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
22
Chp 12 - Protein Structure Basics
Expectation Maximization - generate"random" alignment of all
sequences, derive PSSM, iteratively match individual sequences to
PSSM to edit & improve it
Problems? Can hit a local optimum (premature convergence)
Sensitive to initial alignment
• MEME - Multiple EM for Motif Elicitation - modified EM, avoids
local optimum issues; two step procedure
Gibbs Sampling - generate "trial" PSSM from random alignment
first, as in EM, but leave one sequence out of initial alignment, then
iteratively match PSSM to left-out sequences
• Gibbs Sampler - web-based motif search via Gibbs sampling
• Not mentioned in textbook:
SECTION V
STRUCTURAL BIOINFORMATICS
Xiong: Chp 12
Protein Structure Basics
• LAB 6
• Introduction to Protein DataBank - PDB
• PyMol
• Cn3D?
• Stochastic context-free grammers
• Other "state of the art"pproaches in recent literature, but not
available in web-based servers (yet)
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
BCB 444/544 Fall 07 Dobbs
10/5/07
23
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
24
4
#19 - Protein Structure Basics &
Classification
10/5/07
Protein Structure & Function
Chp 12 - Protein Structure Basics
SECTION V
• Protein structure - primarily determined by sequence
STRUCTURAL BIOINFORMATICS
• Protein function - primarily determined by structure
Xiong: Chp 12
Protein Structure Basics
•
•
•
•
•
•
•
•
• Globular proteins: compact hydrophobic core &
hydrophilic surface
Amino Acids
Peptide Bond Formation
Dihedral Angles
Hierarchy
Secondary Structures
Tertiary Structures
Determination of Protein 3-Dimensional Structure
Protein Structure DataBank (PDB)
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
• Membrane proteins: special hydrophobic surfaces
• Folded proteins are only marginally stable
• Some proteins do not assume a stable "fold" until they bind to
something = Intrinsically disordered
Predicting protein structure and function can be very hard -- & fun!
25
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
26
10/5/07
28
10/5/07
30
Primary & Secondary Structure
4 Basic Levels of Protein Structure
• Primary
• Linear sequence of amino acids
• Description of covalent bonds linking aa’s
• Secondary
• Local spatial arrangement of amino acids
• Description of short-range non-covalent interactions
• Periodic structural patterns: α-helix, β-sheet
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
27
Tertiary & Quaternary Structure
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
"Additional" Structural Levels
• Tertiary
• Super-secondary elements
• Motifs
• Domains
• Foldons
• Overall 3-D "fold" of a single polypeptide chain
• Spatial arrangement of 2’ structural elements; packing of
these into compact "domains"
• Description of long-range non-covalent interactions
(plus disulfide bonds)
• Quaternary
• In proteins with > 1 polypeptide chain, spatial
arrangement of subunits
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
BCB 444/544 Fall 07 Dobbs
10/5/07
29
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
5
#19 - Protein Structure Basics &
Classification
10/5/07
Amino Acids
Peptide Bond is Rigid and Planar
• Each of 20 different amino acids has different "R-Group" or
side chain attached to Cα
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
31
Hydrophobic Amino Acids
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
32
10/5/07
34
10/5/07
36
Charged Amino Acids
10/5/07
33
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Certain Side-chain Configurations are
Energetically Favored (Rotamers)
Polar Amino Acids
Ramachandran plot:
"Allowable" psi & phi angles
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
BCB 444/544 Fall 07 Dobbs
10/5/07
35
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
6
#19 - Protein Structure Basics &
Classification
10/5/07
Glycine is Smallest Amino Acid
R group = H atom
Proline is Cyclic
• Proline residues
reduce flexibility of
polypeptide chain
• Glycine residues increase
• Proline cis-trans
isomerization is often a
rate-limiting step in
protein folding
backbone flexibility because
they have no R group
• Recent work suggests
it also may also
regulate ligand binding
in native proteins
Andreotti (BBMB)
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
37
Cysteines can Form Disulfide (S-S) Bonds
38
• Packing of hydrophobic side chains into interior is main
driving force for folding
• Problem? Polypeptide backbone is highly polar
(hydrophilic) due to polar -NH and C=O in each
peptide unit (which are charged at neutral pH=7,
found in biological systems); these polar groups must
be neutralized
• In eukaryotes,
disulfide bonds are
often found in secreted
proteins or
extracellular domains
10/5/07
10/5/07
Globular Proteins Have a Compact
Hydrophobic Core
• Disulfide bonds
(covalent) stabilize
3-D structures
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
• Solution? Form regular secondary structures,
• e.g., α-helix, β-sheet, stabilized by H-bonds
39
Exterior Surface of Globular Proteins is
Generally Hydrophilic
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
40
10/5/07
42
Protein Secondary Structures
• Hydrophobic core formed by packed secondary
structural elements provides compact, stable core
• α−Helices
• "Functional groups" of protein are attached to this
framework; exterior has more flexible regions (loops)
and polar/charged residues
• Loops
• β−Sheets
• Coils
• Hydrophobic "patches" on protein surface are often
involved in protein-protein interactions
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
BCB 444/544 Fall 07 Dobbs
10/5/07
41
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
7
#19 - Protein Structure Basics &
Classification
10/5/07
Certain Amino Acids are "Preferred" &
Others are Rare in α−Helices
α−Helix: Stabilized by H-bonds between
every ~ 4th residue in Backbone
• Ala, Glu, Leu, Met = good helix formers
• Pro, Gly Tyr, Ser = very poor
• Amino acid composition & distribution varies,
depending on on location of helix in 3-D structure
C = black
O = red
N = blue
H = white
Look! - Charges on backbone are "neutralized"
by hydrogen bonds (H-bonds) - red fuzzy vertical bonds
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
43
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
β-Sheets - also Stabilized by H-bonds
Between Backbone Atoms
Anti-parallel
10/5/07
45
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
46
Chp 13 - Protein Structure Basics
SECTION V
• Regions of 2' structure that are not
helices, sheets, or recognizable turns
STRUCTURAL BIOINFORMATICS
Xiong: Chp 13
Protein Structure Visualization, Comparison &
Classfication
• Intrinsically disordered regions appear to
play important functional roles
BCB 444/544 Fall 07 Dobbs
10/5/07
Connect helices and sheets
Vary in length and 3-D configurations
Are located on surface of structure
Are more "tolerant" of mutations
Are more flexible and can adopt
multiple conformations
• Tend to have charged and polar amino
acids
• Are frequently components of active
sites
• Some fall into distinct structural
families (e.g., hairpin loops,
reverse turns)
•
•
•
•
•
Coils
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
44
Loops
Parallel
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
• Protein Structural Visualization
• Protein Structure Comparison
• Protein Structure Classification
10/5/07
47
BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
10/5/07
48
8
Download