#18 - Protein Motifs & Domains 10/3/07 BCB 444/544

advertisement
#18 - Protein Motifs & Domains
10/3/07
Required Reading
BCB 444/544
(before lecture)
√Mon Oct 1 - Lecture 17
Lecture 18
Protein Motifs & Domain Prediction
• Chp 7 - pp 85-96
More details: HMMs
Wed Oct 3 - Lecture 18
Protein Motifs & Domain Prediction
Protein Structure: The Basics (Note chg in lecture Schedule!)
• Chp 12 - pp 173-186
Thurs Oct 4 - Lab 6
Maybe: Protein Structure - The Basics
Protein Structure: Databases & Visualization
#18_Oct03
Fri Oct 5 - Lecture 19
Protein Structure: Classification & Comparison
• Chp 13 - pp 187-199
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
1
Assignments & Announcements
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
2
BCB 544 - Extra Required Reading
Mon Sept 24
• HW544Extra #1 √Due: Task 1.1 - Mon Oct 1 (today) by noon
Task 1.2 & Task 2 - Mon Oct 8 by 5 PM
BCB 544 Extra Required Reading Assignment:
• Pollard KS, Salama SR, Lambert N, Lambot MA, Coppens S, Pedersen JS,
Katzman S, King B, Onodera C, Siepel A, Kern AD, Dehay C, Igel H, Ares M Jr,
Vanderhaeghen P, Haussler D. (2006) An RNA gene expressed during cortical
development evolved rapidly in humans. Nature 443: 167-172.
• http://www.nature.com/nature/journal/v443/n7108/abs/nature05113.html
• HomeWork #3 - posted online
Due: Mon Oct 8 by 5 PM
doi:10.1038/nature05113
• PDF available on class website - under Required Reading Link
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
3
A few Online Resources for:
Cell & Molecular Biology
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Statistical Inference (Hardcover)
George Casella, Roger L. Berger
• NCBI Science Primer: What is a genome?
StatWeb: A Guide to Basic Statistics for Biologists
• http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html
http://www.dur.ac.uk/stat.web/
• http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html
Basic Statistics:
• BioTech’s Life Science Dictionary
http://www.statsoft.com/textbook/stbasic.html
(correlations, tests, frequencies, etc.)
• http://biotech.icmb.utexas.edu/search/dict-search.html
Electronic Statistics Textbook: StatSoft
http://www.statsoft.com/textbook/stathome.html
• NCBI Bookshelf
(from basic statistics to ANOVA to discriminant analysis, clustering,
regression, data mining, machine learning, etc.)
• http://www.ncbi.nlm.nih.gov/sites/entrez?db=books
BCB 444/544 Fall 07 Dobbs
4
Statistics References
• NCBI Science Primer: What is a cell?
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
10/3/07
5
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
6
1
#18 - Protein Motifs & Domains
10/3/07
Extra Credit Questions #2-#6:
Extra Credit Questions #7 & #8:
2. What is the size of the dystrophin gene (in kb)?
Is it still the largest known human protein?
3. What is the largest protein encoded in human genome (i.e.,
longest single polypeptide chain)?
4. What is the largest protein complex for which a structure is
known (for any organism)?
5. What is the most abundant protein (naturally occurring) on
earth?
6. Which state in the US has the largest number of mobile
genetic elements (transposons) in its living population?
Given that each male attending our BCB 444/544 class on a typical
day is healthy (let's assume MH=7), and is generating sperm at a
rate equal to the average normal rate for reproductively
competent males (dSp/dT = ? per minute):
7a. How many rounds of meiosis will occur during our 50 minute class
period?
7b. How many total sperm will be produced by our BCB 444/544 class
during that class period?
8. How many rounds of meiosis will occur in the reproductively
competent females in our class? (assume FH=5)
For 1 pt total (0.2 pt each): Answer all questions correctly
For 0.6 pts total (0.2 pt each): Answer all questions correctly
For 2 pts total: Prepare a PPT slide with all correct answers
For 1 pts total: Prepare a PPT slide with all correct answers
• Choose one option - you can't earn 3 pts!
• Choose one option - you can't earn more than 1 pt for this!
& submit by to terrible@iastate.edu
& submit by to terrible@iastate.edu
& submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1
& submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1
• Partial credit for incorrect answers? only if they are truly amusing!
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
• Partial credit for incorrect answers? only if they are truly amusing!
7
Answers?
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
8
Chp 6 - Profiles & Hidden Markov Models
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 6
Profiles & HMMs
•
•
•
•
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
9
Statistical Models for Representing
Biological Sequences
10/3/07
10
• HMMs originally developed for speech recognition
• Now widely used in bioinformatics
• Many applications (motif/domain detection, sequence
alignment, phylogenetic
• Are based on MSA
• Capture both observed frequencies & predicted frequencies of
unobserved characters
In order of "sensitivity":
1.PSSM
- scoring table derived from an ungapped MSA; stores
frequencies (log odds scores) for each amino acid in each position
of a protein sequence,
HMMs are "machine learning" algorithms - must be
"trained" to obtain optimal statistical parameters
2.Profile -
A PSSM with gaps: based on gapped MSA with
penalties for insertions & delations
• For Biological sequences:
• each character of a sequence is considered a state in
a Markov process
3.HMM - hidden Markov Model - more complex mathematical model
(than PSSMs or Profiles) because it also differentiates between
insertions and deletions
BCB 444/544 Fall 07 Dobbs
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
HMMs for Biological Sequences?
3 types of probabilistic models, all of which:
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Position Specific Scoring Matrices (PSSMs)
PSI-BLAST
Profiles
Markov Models & Hidden Markov Models
10/3/07
11
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
12
2
#18 - Protein Motifs & Domains
10/3/07
But, What is a Markov Model?
Different Types of Markov Models
Zero-order Markov Model: probability of current state
is independent of previous state(s)
Markov Model (or Markov chain)
= mathematical model used to describe a sequence of
e.g., random sequence, each residue with equal frequency
events that occur one after another in a chain
= a process that moves in one direction from one state to
the next with a certain transition probability
For biological sequences:
First-order MM: probability of current state is
determined by the previous state
e.g., frequencies of two linked residues (dimer) occurring
simultaneously
Second-order MM: describes situation in which
• each letter = state
• linked together by transition probabilities
probability of current state is determined by the
previous two states
e.g., frequencies of thee linked residues (trimers) occurring simultaneously, as in a codon
Higher orders? Also possible, later…
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
13
So, What is a hidden Markov Model?
14
Goal: Find most likely explanation for observed variables
Components:
• States - composed of a number of elements or "symbols" (e.g.,
A,C,G,T)
• Observed variables - sequence (or outcome) we can "see"
• Hidden variables - insertions/deletions/transition probabilities
that can't be "seen"
• Emission probability - probability value associated with each
"symbol" in each state
• Transition probability - probability of going from one state to
another
- MM which: combines 2 or more Markov chains:
• only 1 chain is made up of observed states
• other chains are made up of unobserved or "hidden"
states
10/3/07
10/3/07
Hidden Markov Models - HMMs
Hidden Markov Model (HMM)
- a more sophisticated model in which some of states
are hidden
- some "unobserved" factors influence the state
transition probabilities
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
• Special graphical representation used to illustrate
relationships
15
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
16
This is a new slide
An HMM for CpG Islands?
HMM example from Eddy HMM paper:
Toy HMM for Splice Site Prediction
Emission probabilities are 0 or 1 e.g., eG-(G) = 1, eG-(T) = 0
See Durbin et al., Biological Sequence Analysis, Cambridge, 1998
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
BCB 444/544 Fall 07 Dobbs
10/3/07
17
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
18
3
#18 - Protein Motifs & Domains
10/3/07
This slide has been changed
An HMM for Occasionally Dishonest Casino
Calculating Different Paths to an
Observed Sequence
Calculations such as those shown below are used to fill a matrix
with probability values for every state at every position
transition probability
x = x 1 , x 2 , x 3 = 6,2,6
Pr( x, π
emission probability
(1)
) = a 0 F e F (6)a FF e F (2)a FF eF (6)
π (1 ) = FFF
Transition probabilities
π ( 2 ) = LLL
• Prob(Fair → Loaded) = 0.01
• Prob(Loaded → Fair) = 0.2
1
1
1
= 0.5 × × 0.99 × × 0.99 ×
6
6
6
≈ 0.00227
Pr( x , π ( 2 ) ) = a 0 LeL ( 6)a LLe L (2)a LLe L ( 6)
= 0 .5 × 0 .5 × 0 .8 × 0 .1 × 0 .8 × 0 .5
= 0 .008
π
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
(3)
= LFL
Pr( x , π ( 3) ) = a 0 LeL ( 6)a LF eF (2)aFL eL ( 6)a L 0
= 0 .5 × 0 .5 × 0 .2 ×
≈ 0 .0000417
19
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
This slide has been changed
Note: This not the same as matrix on previous slide!
Here, last column contains sums for each row
x
π
2
B
1
0
0
F
0
(1/6)×(1/2)
= 1/12
L
0
(1/2)×(1/2)
= 1/4
(1/6)×max{(1/12)×0.99
,
(1/4)×0.2}
= 0.01375
(1/10)×max{(1/12)×0.01,
(1/4)×0.8}
= 0.02
x
6
0
(1/6)×max{0.01375×0.99,
0.02×0.2}
= 0.00226875
π
(1/2)×max{0.01375×0.01,
0.02×0.8}
= 0.08
v k (i ) = ek (x i ) max (v r (i − 1)a rk )
ε
6
2
B
1
0
0
F
0
(1/6)×(1/2)
= 1/12
(1/6)×sum{(1/12)×0.99,
(1/4)×0.2}
= 0.022083
(1/6)×sum{0.022083×0.99,
0.020083×0.2}
= 0.004313
L
0
(1/2)×(1/2)
= 1/4
(1/10)×sum{(1/12)×0.01,
(1/4)×0.8}
= 0.020083
(1/2)×sum{0.022083×0.01,
0.020083×0.8}
= 0.008144
Total probability =
r
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
20
This slide has been changed
* Path within HMM that matches query sequence with highest probability
6
10/3/07
Calculating the Total Probability:
Calculating the Most Probable Path*, using
Viterbi algorithm (using traceback as in DP)
ε
1
× 0 .01 × 0.5
6
21
This slide has been changed
Estimating the Probabilities
or “Training” the HMM
Pr( x, π )
∑
π
6
0
= 0 + 0.004313 + 0.008144 = 0.012
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
22
Profile HMMs
• Used to model a family of related sequences
• Calculate frequencies in each column of MSA built from set of
related sequences
• Use frequency values to fill the emission and transition
probabilities in the model (use two matrices for this)
(or motif or domain)
• Derived from a MSA of family members
• Transition & emission probabilities are position-specific
• Viterbi training
• Derive probable paths for training data using Viterbi algorithm
• Re-estimate transition probabilities based on Viterbi path
• Iterate until paths stop changing
• Set parameters of model so that total probability peaks at members
of family
• Sequences can be tested for family membership using
Viterbi algorithm to evaluate match against profile
• Other algorithms can be used
• e.g., "forward" & "backward" algorithms
• (see text - or see Wikipedia re: HMMs)
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
BCB 444/544 Fall 07 Dobbs
10/3/07
23
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
24
4
#18 - Protein Motifs & Domains
10/3/07
This slide has been changed
Example: Pfam: Protein Families
Profile HMM represents a gapped MSA
http://pfam.sanger.ac.uk/
Character in alignment can
be in one of 3 states:
Match - observed
Insert - hidden
Delete - hidden
• “A comprehensive collection of protein domains and families,
with a range of well-established uses including genome
annotation.”
•
Pfam: clans, web tools and services: R.D. Finn, …A. Bateman (2006)
Nucleic Acids Res Database Issue 34:D247-D5
• Each family is represented by:
• 2 MSAs
Hidden chains
• 2 Hidden Markov Models (profile-HMMs)
• cf. Superfamily - from Lab 5
Observed chain
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
25
• Common problem in machine learning (data-driven) approaches
• Limited training sample size causes over-representation of observed
characters while "ignoring" unobserved characters
• Result? Miss members of family not yet sampled
• In previous lab: used SuperFam (HMMs)
• Treated as a 'real' values in calculating probabilities
• http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/
• Prosite - includes patterns (regular expressions) & profiles
for motifs & domains
• http://ca.expasy.org/prosite
• Pfam (MSAs & HMMs)
(new URL)
• http://pfam.sanger.ac.uk/
• Many others
• Improve predictive power of profiles & HMMs
• Dirichlet mixture - commonly used mathematical model to simulate
the aa distribution in a sequence alignment
• To "correct" problems in an observed alignment based on limited
number of sequences
27
Chp 7 - Protein Motifs & Domain Prediction
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
28
Motifs & Domains
• Motif - short conserved sequence pattern
SEQUENCE ALIGNMENT
• Associated with distinct function in protein or DNA
• Avg = 10 residues (usually 6-20 residues)
• e.g., zinc finger motif - in protein
• e.g., TATA box - in DNA
Xiong: Chp 7
Protein Motifs and Domain Prediction
•
•
•
•
•
•
26
• Psi-BLAST - you've heard enough about this!
• Uses Profiles (not actually PSSMs) - iteratively
(too many false negative hits)
• Pseudocounts - adding artificial values for 'extra' amino acid(s) not
observed in the training set
SECTION II
10/3/07
• HMMer - for building & using HMMs
• developed by Sean Eddy's group
• Not a web-based server; must download the software
• 9 related programs
• but check out the site - it's fun!
• Smoothing or "Regularization" - method used to avoid "over-fitting"
10/3/07
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Applications (of PSSMs, Profiles, HMMs)
A few more Details re: Profiles & HMMs
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
• similar collection of curated MSAs & HMMs, focuses on
superfamily level
Identification of Motifs & Domains in MSAs
Motif & Domain Databases Using Regular Expressions
Motif & Domain Databases Using Statistical Models
Protein Family Databases
Motif Discovery in Unaligned Sequences
√Sequence Logos
• Domain - "longer" conserved sequence pattern, defined
as a independent functional and/or structural unit
• Avg = 100 residues (range from 40-700 in proteins)
• e.g., kinase domain or transmembrane domain - in protein
• Domains may (or may not) include motifs
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
BCB 444/544 Fall 07 Dobbs
10/3/07
29
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
30
5
#18 - Protein Motifs & Domains
10/3/07
2 Approaches for Representing "Consensus"
Information in Motifs & Domains
Based on regular expressions:
• Regular expression - reduce information from MSA
• Prosite (Interpro)
• Emofit
Limitation: these don't take probability info into account
• e.g., protein phosphorylation site motif: [S,T]- X- [R,K]
• Symbols represent specific or unspecified residues, spaces,
etc.
• 2 mechanisms for matching:
• Exact
• "Fuzzy" (inexact, approximate) - flexible, more permissive
to detect "near matches"
Based on statistical models:
•
•
•
•
•
•
•
• Statistical model - includes probability information
derived from MSA
• e.g., PSSM, Profile or HMM
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
Motif & Domain Databases
31
PRINTS
BLOCKS
ProDom
Pfam
SMART
CDART
Reverse PsiBLAST
• READ your textbook & try some
of these at home; there are
distinct advantages/disadvantages
associated with each
• TAKE HOME LESSON:
Always try several methods!
(not just one!)
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
10/3/07
32
Chp 12 - Protein Structure Basics
SECTION V
STRUCTURAL BIOINFORMATICS
Xiong: Chp 12
Protein Structure Basics
• Introduction to the Protein DataBank - PDB
• NEXT lecture!
BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
BCB 444/544 Fall 07 Dobbs
10/3/07
33
6
Download