#17 - Protein Motifs & Domain 10/1/07 Prediction HMMs

advertisement
#17 - Protein Motifs & Domain
Prediction
10/1/07
Required Reading
BCB 444/544
(before lecture)
Mon Oct 1 - Lecture 17
Lecture 17
Protein Motifs & Domain Prediction
• Chp 7 - pp 85-96
Finish HMMs
Wed Oct 3 - Lecture 18
Protein Structure: The Basics (Note chg in lecture Schedule!)
• Chp 12 - pp173-186
Protein Motifs &
Domain Prediction
Thurs Oct 4 - Lab 6
Protein Structure: Databases & Visualization
#17_Oct01
Fri Oct 5 - Lecture 19
Protein Structure: Classification & Comparison
• Chp 13 - pp187-199
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
1
Assignments & Announcements
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
2
BCB 544 - Extra Required Reading
Mon Sept 24
• HW544Extra #1 Due: Task 1.1 - Mon Oct 1 (today) by noon
Task 1.2 & Task 2 - Mon Oct 8 by 5 PM
BCB 544 Extra Required Reading Assignment:
• Pollard KS, Salama SR, Lambert N, Lambot MA, Coppens S, Pedersen JS,
Katzman S, King B, Onodera C, Siepel A, Kern AD, Dehay C, Igel H, Ares M Jr,
Vanderhaeghen P, Haussler D. (2006) An RNA gene expressed during cortical
development evolved rapidly in humans. Nature 443: 167-172.
• http://www.nature.com/nature/journal/v443/n7108/abs/nature05113.html
doi:10.1038/nature05113
• HomeWork #3 - posted online
Due: Mon Oct 8 by 5 PM
• PDF available on class website - under Required Reading Link
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
3
A few Online Resources for:
Cell & Molecular Biology
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Statistical Inference (Hardcover)
George Casella, Roger L. Berger
• NCBI Science Primer: What is a genome?
StatWeb: A Guide to Basic Statistics for Biologists
• http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html
http://www.dur.ac.uk/stat.web/
• http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html
Basic Statistics:
http://www.statsoft.com/textbook/stbasic.html
(correlations, tests, frequencies, etc.)
• BioTech’s Life Science Dictionary
• http://biotech.icmb.utexas.edu/search/dict-search.html
Electronic Statistics Textbook: StatSoft
http://www.statsoft.com/textbook/stathome.html
(from basic statistics to ANOVA to discriminant analysis, clustering,
regression, data mining, machine learning, etc.)
• NCBI Bookshelf
• http://www.ncbi.nlm.nih.gov/sites/entrez?db=books
BCB 444/544 Fall 07 Dobbs
4
Statistics References
• NCBI Science Primer: What is a cell?
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
10/1/07
5
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
6
1
#17 - Protein Motifs & Domain
Prediction
10/1/07
Extra Credit Questions #2-#6:
Extra Credit Questions #7 & #8:
2. What is the size of the dystrophin gene (in kb)?
Is it still the largest known human protein?
3. What is the largest protein encoded in human genome (i.e.,
longest single polypeptide chain)?
4. What is the largest protein complex for which a structure is
known (for any organism)?
5. What is the most abundant protein (naturally occurring) on
earth?
6. Which state in the US has the largest number of mobile
genetic elements (transposons) in its living population?
Given that each male attending our BCB 444/544 class on a typical
day is healthy (let's assume MH=7), and is generating sperm at a
rate equal to the average normal rate for reproductively
competent males (dSp /dT = ? per minute):
7a. How many rounds of meiosis will occur during our 50 minute class
period?
7b. How many total sperm will be produced by our BCB 444/544 class
during that class period?
8. How many rounds of meiosis will occur in the reproductively
competent females in our class? (assume FH=5)
For 1 pt total (0.2 pt each): Answer all questions correctly
For 0.6 pts total (0.2 pt each): Answer all questions correctly
For 2 pts total: Prepare a PPT slide with all correct answers
For 1 pts total: Prepare a PPT slide with all correct answers
• Choose one option - you can't earn 3 pts!
• Choose one option - you can't earn more than 1 pt for this!
& submit by to terrible@iastate.edu
& submit by to terrible@iastate.edu
& submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1
& submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1
• Partial credit for incorrect answers? only if they are truly amusing!
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
• Partial credit for incorrect answers? only if they are truly amusing!
7
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Answers?
10/1/07
8
Chp 6 - Profiles & Hidden Markov Models
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 6
Profiles & HMMs
• √ Position Specific Scoring Matrices (PSSMs)
• √ PSI-BLAST
TODAY:
• Profiles
• Markov Models & Hidden Markov Models
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
9
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Statistical Models for Representing
Biological Sequences
10/1/07
10
Sequence Motifs (Patterns)
Types of representations:
3 types of probabilistic models, all of which:
•√
•√
•√
•√
• Are based on MSA
• Capture both observed frequencies & predicted frequencies of
unobserved characters
In order of "sensitivity":
1.PSSM - scoring table derived from an ungapped MSA; stores
frequencies (log odds scores) for each amino acid in each position
of a protein sequence,
Consensus Sequences
Sequence Logos
PSSMs - Position-Specific Scoring Matrices
Profiles
HMMs - Hidden Markov Models
2.Profile - A PSSM with gaps: based on gapped MSA with
penalties for insertions & delations
3.HMM - hidden Markov Model - more complex mathematical model
(than PSSMs or Profiles) because it also differentiates between
insertions and deletions
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
BCB 444/544 Fall 07 Dobbs
10/1/07
11
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
12
2
#17 - Protein Motifs & Domain
Prediction
10/1/07
CpG Islands
HMM example: CpG Islands
Written CpG to distinguish
from a C G base pair)
Nucleotide frequencies in human genome:
A
C
T
G
20.4
29.5
20.5
29.6
• CpG dinucleotides are rarer than would be expected
from independent probabilities of C and G (given the
background frequencies in human genome)
• High CpG frequency is sometimes biologically
significant; e.g., sometimes associated with promoter
regions (“start sites”for genes)
• CpG island - a region where CpG dinucleotides are
much more abundant than elsewhere
How can we represent or model CpG islands?
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
13
Hidden Markov Models - HMMs
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
14
Different Types of Markov Models
Goal: Find most likely explanation for observed variables
Zero-order Markov Model: probability of current state
is independent of previous state(s)
Components:
• Observed variables
• Hidden variables
• Emitted symbols
First-order MM: probability of current state is
determined by the previous state
e.g., random sequence, each residue with equal frequency
e.g., frequencies of two linked residues (dimer) occurring
simultaneously
• Emission probabilities
• Transition probabilities
Second-order MM: describes situation in which
probability of current state is determined by the
previous two states
• Graphical representation to illustrate relationships among these
e.g., frequencies of thee linked residues (trimers) occurring simultaneously, as in a codon
Higher orders? Also possible, later…
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
15
But, What is a Markov Model?
16
Hidden Markov Model (HMM)
- a more sophisticated model in which some of states
are hidden
- some "unobserved" factors influence the state
transition probabilities
For biological sequences:
- MM which: combines 2 or more Markov chains:
• only 1 chain is made up of observed states
• other chains are made up of unobserved or "hidden"
states
• each letter = state
• linked together by transition probabilities
BCB 444/544 Fall 07 Dobbs
10/1/07
So, What is a hidden Markov Model?
Markov Model (or Markov chain)
= mathematical model used to describe a sequence of
events that occur one after another in a chain
= a process that moves in one direction from one state
to the next with a certain transition probability
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
17
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
18
3
#17 - Protein Motifs & Domain
Prediction
10/1/07
HMMs for Biological Sequences?
Hidden Markov Models - HMMs
Goal: Find most likely explanation for observed variables
• HMMs originally developed for speech recognition
• Now widely used in bioinformatics
• Many applications (motif/domain detection, sequence
alignment, phylogenetic
Components:
• States - composed of a number of elements or "symbols" (e.g.,
A,C,G,T)
• Observed variables - sequence (or outcome) we can "see"
• Hidden variables - insertions/deletions/transition probabilities
that can't be "seen"
• Emission probability - probability value associated with each
"symbol" in each state
• Transition probability - probability of going from one state to
another
HMMs are "machine learning" algorithms - must be
"trained" to obtain optimal statistical parameters
• For Biological sequences:
• each character of a sequence is considered a state in
a Markov process
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
• Special graphical representation used to illustrate
relationships
19
HMM example from Eddy HMM paper:
Toy HMM for Splice Site Prediction
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
20
The Occasionally Dishonest Casino
A casino uses a fair die most of the time, but occasionally
switches to a "loaded" one
• Fair die:
Prob(1) = Prob(2) = . . . = Prob(6) = 1/6
• Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10, Prob(6) = ½
• These are emission probabilities
Transition probabilities
• Prob(Fair → Loaded) = 0.01
• Prob(Loaded → Fair) = 0.2
• Transitions between states obey a Markov process
a linear chain of events linked by probability values such that the
occurrence of one event (state) depends on the occurrence of
previous event(s) or state(s)
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
21
An HMM for Occasionally Dishonest Casino
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
22
10/1/07
24
The Occasionally Dishonest Casino
• Known:
• Structure of the model
• Transition probabilities
• Hidden: What casino actually did
• FFFFFLLLLLLLFFFF...
• Observable: Series of die tosses
• 3415256664666153...
• What we must infer:
Transition probabilities
• Prob(Fair → Loaded) = 0.01
• Prob(Loaded → Fair) = 0.2
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
BCB 444/544 Fall 07 Dobbs
• When was a fair die used?
• When was a loaded one used?
• Answer is a sequence
FFFFFFFLLLLLLFFF...
10/1/07
23
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
4
#17 - Protein Motifs & Domain
Prediction
10/1/07
HMM: Making the Inference
HMM Notation
• Model assigns a probability to each explanation for the
observation, e.g.:
• x = sequence of symbols emitted by model
• xi = symbol emitted at time i
• π = path, a sequence of states
P(326|FFL)
= P(3|F) · P(F→F) · P(2|F) · P(F→L) · P(6|L)
=
1/6 · 0.99 · 1/6
· 0.01 · ½
• i-th state in π is πi
• akr = transition probability, for making a transition
from state k to state r
• Maximum Likelihood: Determine which explanation is most likely
akr = Pr(" i = r | " i !1 = k )
• Find path most likely to have produced observed sequence
• ek ( b ) =
• Total Probability: Determine probability that observed sequence
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
ek (b ) = Pr(xi = b | ! i = k )
25
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Calculating Different Paths to an
Observed Sequence
!
(1 )
! * = arg max Pr(x , ! )
!
To find π*, consider all possible ways the last "symbol"
of x could have been emitted
Let
Pr(x , " (2) ) = a0 LeL (6)aLLeL (2)aLLeL (6)
! (2) = LLL
= 0.5 ! 0.5 ! 0.8 ! 0.1 ! 0.8 ! 0.5
!
! (3) = LFL
v k (i ) = Prob. of path ! 1 , L, ! i most likely
= 0.008
Pr(x , #
( 3)
) = a0 LeL (6)aLF eF (2)aFLeL (6)aL 0
Then
1
= 0.5 " 0.5 " 0.2 " " 0.01 " 0.5
6
! 0.0000417
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
10/1/07
28
Viterbi for Most Probable Path: Example
x
How: one way = Viterbi Algorithm
• Initialization (i = 0)
v 0 (0) = 1, v k (0) = 0 for k > 0
π
For each state k
v k (i ) = ek (xi ) max(v r (i ! 1)ark )
r
• Termination:
Pr(x , ! * ) = max(v k (L)ak 0 )
k
10/1/07
ε
6
2
B
1
0
0
6
F
0
(1/6)×(1/2)
= 1/12
(1/6)×max{(1/12)×0.99,
(1/4)×0.2}
= 0.01375
(1/6)×max{0.01375×0.99,
0.02×0.2}
= 0.00226875
L
0
(1/2)×(1/2)
= 1/4
(1/10)×max{(1/12)×0.01,
(1/4)×0.8}
= 0.02
(1/2)×max{0.01375×0.01,
0.02×0.8}
= 0.08
0
v k (i ) = ek (xi ) max(v r (i ! 1)ark )
r
To find π*, use trace-back, as in dynamic programming
BCB 444/544 Fall 07 Dobbs
v k (i ) = ek (xi ) max(v r (i ! 1)ark )
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
probability values for every state at every residue
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
to emit x1 , K, xi such that ! i = k
r
27
Calculate optimal path? Construct a matrix of
• Recursion (i = 1, . . . , L ):
26
The most likely path π* satisfies:
emission probability
Pr(x, " (1) ) = a0F eF (6)aFF eF (2)aFF eF (6)
1
1
1
= 0.5 # # 0.99 # # 0.99 #
6
6
6
$ 0.00227
= FFF
10/1/07
Identifying the Most Probable Path?
transition probability
x = x1 , x 2 , x 3 = 6,2,6
emission probability, that symbol b is emitted
when in state k
was produced by HMM
• Consider all paths that could have produced observed sequence
29
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
30
5
#17 - Protein Motifs & Domain
Prediction
10/1/07
Total Probability
Total Probability: Example
x
Several different paths can result in observation x
ε
Probability that our model will emit x is:
π
Pr(x ) = ! Pr(x , " )
2
6
B
1
0
0
F
0
(1/6)×(1/2)
= 1/12
(1/6)×sum{(1/12)×0.99,
(1/4)×0.2}
= 0.022083
(1/6)×sum{0.022083×0.99,
0.020083×0.2}
= 0.004313
L
0
(1/2)×(1/2)
= 1/4
(1/10)×sum{(1/12)×0.01,
(1/4)×0.8}
= 0.020083
(1/2)×sum{0.022083×0.01,
0.020083×0.8}
= 0.008144
"
Total
Probability
6
Total probability =
! Pr(x, " )
0
= 0.004313 + 0.008144 = 0.012
"
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
31
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
32
An HMM for CpG Islands?
Viterbi gets it right more often than not
Emission probabilities are 0 or 1 e.g., eG-(G) = 1, eG-(T) = 0
See Durbin et al., Biological Sequence Analysis, Cambridge, 1998
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
33
Estimating the Probabilities
or “Training” the HMM
34
• Used to model a family of related sequences
• Derive probable paths for training data using Viterbi algorithm
• Re-estimate transition probabilities based on Viterbi path
• Iterate until paths stop changing
(or motif or domain)
• Derived from a MSA of family members
• Transition & emission probabilities are position-specific
• Other algorithms can be used
• Set parameters of model so that total probability peaks
at members of family
• e.g., "forward" algorithm
• (see text - or see Wikipedia re: HMMs)
BCB 444/544 Fall 07 Dobbs
10/1/07
Profile HMMs
• Viterbi training
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
• Sequences can be tested for family membership using
Viterbi algorithm to evaluate match against profile
10/1/07
35
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
36
6
#17 - Protein Motifs & Domain
Prediction
10/1/07
Pfam: Protein Families
An HMM can represent a MSA
http://pfam.sanger.ac.uk/
• “A comprehensive collection of protein domains and families,
with a range of well-established uses including genome
annotation.”
•
Pfam: clans, web tools and services: R.D. Finn, J. Mistry, B. SchusterBkler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M.
Marshall, A. Khanna, R. Durbin, S.R. Eddy, E.L.L. Sonnhammer and A.
Bateman (2006) Nucleic Acids Res Database Issue 34:D247-D5
• Each family is represented by:
• 2 MSAs
• 2 Hidden Markov Models (profile-HMMs)
• cf. Superfamily - from Lab 5
• similar collection of curated MSAs & HMMs, focuses on
superfamily level
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
37
Chp 7 - Protein Motifs & Domain Prediction
SECTION II
10/1/07
40
Motifs & Domains
• e.g., zinc finger motif - in protein
• e.g., TATA box - in DNA
Identification of Motifs & Domains in MSAs
Motif & Domain Databases Using Regular Expressions
Motif & Domain Databases Using Statistical Models
Protein Family Databases
Motif Discovery in Unaligned Sequences
√ Sequence Logos
BCB 444/544 Fall 07 Dobbs
38
• Associated with distinct function in protein or DNA
• Avg = 10 residues (usually 6-20 residues)
Protein Motifs and Domain Prediction
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
• Motif - short conserved sequence pattern
SEQUENCE ALIGNMENT
Xiong: Chp 7
•
•
•
•
•
•
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
10/1/07
• Domain - "longer" conserved sequence pattern, defined as a
independent functional and/or structural unit
• Avg = 100 residues (range from 40-700 in proteins)
• e.g., kinase domain or transmembrane domain - in protein
• Domains may (or may not) include motifs
39
BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
7
Download