Lecture 16 #16_Sept28 Profiles & Hidden Markov Models (HMMs)

advertisement
BCB 444/544
Lecture 16
Profiles &
Hidden Markov Models (HMMs)
#16_Sept28
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
1
Required Reading
(before lecture)
√ Mon & Wed Sept 24 & 26- Lecture 14 & 15
Review: Nucleus, Chromosomes, Genes, RNAs, Proteins
Surprise lecture: No assigned reading
√Fri Sept 28 - Lectures 16
Profiles & Hidden Markov Models
• Chp 6 - pp 79-84
• Eddy: What is a hidden Markov Model?
2004 Nature Biotechnol 22:1315
http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html
Thurs Sept 27 - Lab 4 & Mon Oct 1 - Lecture 17
Protein Families, Domains, and Motifs
• Chp 7 - pp 85-96
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
2
Assignments & Announcements
Fri Sept 26
• Exam 1 - Graded & returned in class - Really!
• HW#2 - Graded & returned in class - Really!
• Answer KEYs posted on website
• Grades posted on WebCT
• HomeWork #3 - posted online
Due: Mon Oct 8 by 5 PM
• HW544Extra #1 - posted online
Due: Task 1.1 - Mon Oct 1 by noon
Task 1.2 & Task 2 - Mon Oct 8 by 5 PM
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
3
BCB 544 - Extra Required Reading
Mon Sept 24
BCB 544 Extra Required Reading Assignment:
• Pollard KS, Salama SR, Lambert N, Lambot MA, Coppens S, Pedersen JS,
Katzman S, King B, Onodera C, Siepel A, Kern AD, Dehay C, Igel H, Ares M Jr,
Vanderhaeghen P, Haussler D. (2006) An RNA gene expressed during cortical
development evolved rapidly in humans. Nature 443: 167-172.
• http://www.nature.com/nature/journal/v443/n7108/abs/nature05113.html
doi:10.1038/nature05113
• PDF available on class website - under Required Reading Link
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
4
Extra Credit Questions #2-6:
2. What is the size of the dystrophin gene (in kb)?
Is it still the largest known human protein?
3. What is the largest protein encoded in human genome (i.e.,
longest single polypeptide chain)?
4. What is the largest protein complex for which a structure is
known (for any organism)?
5. What is the most abundant protein (naturally occurring) on
earth?
6. Which state in the US has the largest number of mobile
genetic elements (transposons) in its living population?
For 1 pt total (0.2 pt each): Answer all questions correctly
& submit by to terrible@iastate.edu
For 2 pts total: Prepare a PPT slide with all correct answers
& submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1
• Choose one option - you can't earn 3 pts!
• Partial credit for incorrect answers? only if they are truly amusing!
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
5
Extra Credit Questions #7 & #8:
Given that each male attending our BCB 444/544 class on a typical
day is healthy (let's assume MH=7), and is generating sperm at a
rate equal to the average normal rate for reproductively
competent males (dSp/dT = ? per minute):
7a. How many rounds of meiosis will occur during our 50 minute class
period?
7b. How many total sperm will be produced by our BCB 444/544 class
during that class period?
8. How many rounds of meiosis will occur in the reproductively
competent females in our class? (assume FH=5)
For 0.6 pts total (0.2 pt each): Answer all questions correctly
& submit by to terrible@iastate.edu
For 1 pts total: Prepare a PPT slide with all correct answers
& submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1
• Choose one option - you can't earn more than 1 pt for this!
• Partial credit for incorrect answers? only if they are truly amusing!
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
6
Information flow in the cell?
• DNA -> RNA -> protein:
• Replication = DNA to DNA - by DNA polymerase
• Transcription = DNA to RNA - by RNA polymerase
• Translation = RNA to protein - by ribosomes
• Exceptions/Complications:
• DNA rearrangements: (by mobile genetic elements, recombination)
• Reverse transcription: (RNA -> DNA, by reverse transcriptase)
• Post-transcriptional modifications:
• RNA splicing (removal of introns, by spliceosome)
• RNA editing (addition/removal of nucleotides - usually U's)
• Post-translational modifications:
• Protein processing
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
7
Modeling Metabolic Pathways? see MetNet
http://metnet.vrac.iastate.edu/MetNet_overview.htm
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
8
Chromosomes & Genes
Genes in chromatin are not just “beads on a string”
they are packaged in complex structures that we
don't yet fully understand
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
9
Gene regulation
• Transcriptional regulation is primarily mediated by
proteins that bind cis-acting elements or DNA sequence
signals associated with genes:
• DNA level (sequence-specific) regulatory signals
• Promoters, terminators
• Enhancers, repressors, silencers
• Chromatin level (global) regulation
• Heterochromatin (inactive)
•e.g., X-inactivation in female mammals
• In eukaryotes, genes are often regulated at other levels:
• Post-transcriptional (RNA transport, splicing, stability)
• Post-translational (protein localization, folding, stability)
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
10
Promoter = DNA sequences required for
initiation of transcription; contain TF binding
sites, usually "close" to start site
• Transcription factors (TFs) - proteins that regulate transcription
• (In eukaryotes) RNA polymerase binds by recognizing a complex of
TFs bound at promotor
First, TFs must
bind TF binding
sites (TFBSs) within
promoters; then
RNA polymerase can
bind and initiate
transcription of
RNA
~200 bp
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Pre-mRNA
9/28/07
11
Enhancers & repressors = DNA sequences
that regulate initiation of transcription;
contain TF binding sites,can be far from start site!
RNAP = RNA polymerase II
Promoter
Enhancer
Repressor
10-50,000 bp
Enhancers "enhance"
transcription
Repressors or silencers
"repress" transcription
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Gene
Enhancer binding
proteins (TFs)
interact with RNAP
Repressor binding
proteins (TFs) block
transcription
9/28/07
12
Transcription factors (TFs) &
their binding sites (TFBSs)
• Transcription factors - trans-acting factors - proteins
that either activate or repress transcription, usually by
binding DNA (via a DNA binding domain) & interacting
with RNA polymerase (via a "trans-activating domain) to
affect rate of transcription initiation
• Promotors, enhancers, and repressors - all contain binding sites
for transcription factors
• Promoters - usually located close to start site;
vs
• Enhancers/Silencers/Repressor sequences - can be close or very
far away: located upstream, downstream or even within the coding
sequence of genes !!
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
13
"Non-coding" DNA? Many genes encode
RNA that is not translated
4 Major Classes of RNA:
1. mRNA = messenger RNA
2. tRNA = transfer RNA
3. rRNA = ribosomal RNA
4. "Other" - Lots of these, diverse structures & functions:
"Natural" RNAs:
• siRNA, miRNA, piRNA, snRNA, snoRNA, …
• ribozymes
• Artificial RNAs:
• RNAi
• antisense RNA
•
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
14
RNA Sequence, Structure & Function
• RNAs can have complex 3D stuctures (like proteins) &
have many important functions in cellular processes
Ribosomes contain
RNAs & proteins
Ribozymes are RNA enzymes capable
of RNA cleavage
• RNA molecules are believed to be precursors to DNA-based life
• Form complementary base pairs and replicate (like DNA)
• Perform enzymatic functions (like proteins)
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
15
Protein Sequence, Structure & Function
• Amino acid sequence determines protein structure
• But some proteins need help folding ("chaperones") in vivo
• Protein structure determines function
• But level, timing & location of expression are important
• Interactions with other proteins, DNA, RNA, & small
ligands are also very important!!
• We don't know the "folding code" that determines how
proteins fold!
• We don't know the "recognition code" that determines
how proteins find and interact with correct partners!
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
16
A few Online Resources for:
Cell & Molecular Biology
•
•
•
•
NCBI Science Primer: What is a cell?
NCBI Science Primer: What is a genome?
BioTech’s Life Science Dictionary
NCBI bookshelf
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
17
Chp 6 - Profiles & Hidden Markov Models
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 6
Profiles & HMMs
• √Position Specific Scoring Matrices (PSSMs)
• √PSI-BLAST
TODAY:
• Profiles
• Markov Models & Hidden Markov Models
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
18
Algorithms & Software for MSA? #3
(NOT covered on Exam1)
Heuristic Methods - continued
• Progressive alignments (Star Alignment, Clustal)
• Others: T-Coffee, DbClustal -see text: can be better than Clustal
• Match closely-related sequences first using a guide tree
• Partial order alignments (POA)
• Doesn't rely on guide tree; adds sequences in order given
• PRALINE
• Preprocesses input sequences by building profiles for each
• Iterative methods
• Idea: optimal solution can be found by repeatedly modifying existing
suboptimal solutions (eg: PRRN)
• Block-based Alignment
• Multiple re-building attempts to find best alignment
(eg: DIALIGN2 & Match-Box)
• Local alignments
• Profiles, Blocks, Patterns - more on these soon!
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
19
Applications of MSA
• Building phylogenetic trees
• Finding conserved patterns:
• Regulatory motifs (TF binding sites)
• Splice sites
• Protein domains
• Identifying and characterizing protein families
• Find out which protein domains have same function
• Finding SNPs (single nucleotide polymorphisms) &
mRNA isoforms (alternatively spliced forms)
• DNA fragment assembly (in genomic sequencing)
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
20
Application: Discover Conserved Patterns
Is there a conserved cis-acting regulatory sequence?
Rationale: if sequences are homologous (derived from a common ancestor),
they may be structurally/functionally equivalent
TATA box = transcriptional
promoter element
Sequence Logo
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
21
Patterns can also be represented as
Sequence Logos
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
22
Sequence Logo
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
23
Sequence Logos: for Promoter
elements (TF Binding Sites)
• Example was created from a set of TATA binding sites from
TRANSFAC database.
• http://www.gene-regulation.com/pub/databases.html
• Logo was created by WebLogo.
• http://weblogo.berkeley.edu/logo.cgi
• Can see TATA-box quite easily.
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
24
Sequence Logos - for RNA Splicing Sites
Human intron donor
and acceptor sites
http://www-lmmb.ncifcrf.gov/~toms/gallery/SequenceLogoSculpture.gif
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
25
PSSM vs Profile
Position-Specific Scoring Matrix:
from ungapped MSA
PSI-BLAST Pseudocode
Convert query to PSSM (or a Profile)
do {
BLAST database with PSSM
Stop if no new homologs are found
Add new homologs to PSSM
}
Print current set of homologs
Profile:
from MSA, including gaps
Note: Xiong textbook distinguishes between PSSMs (which
have no gaps) & Profiles (can include gaps).
Thus, based on these definitions, PSI-BLAST uses a Profile
to iteratively add new homologs - other authors refer to
pattern used by PSI-BLAST as a PSSM.
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
26
I added
more text to
this slide
What is a PSSM?
Position-Specific Scoring Matrix
A PSSM is:
• a representation of a motif
• an n by m matrix, where n is
size of alphabet & m is length of
sequence
• a matrix of scores in which
entry at (i, j) is score assigned
by PSSM to letter i at the jth
position
Xiong: PSSM = table that contains
probability information re: residues at
each position of an ungapped MSA
Also, sometimes called:
Position Weight Matrix (PWM)
8 residue sequence
A
-1
-2
-1
0
-1
-2
0
-2
R
5
0
5
-2
1
-3
-2
0
N
0
6
0
0
0
-3
0
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
Q
1
0
1
-2
5
-3
-2
0
E
0
0
0
-2
2
-3
-2
0
G
-2
0
-2
6
-2
-3
6
-2
H
0
1
0
I
-3
-3
-3
L
-2
-3
-2
-4
-2
0
-4
-3
K
2
0
2
-2
1
-3
-2
-1
M
-1
-2
-1
-3
0
0
-3
-2
F
-3
-3
-3
-3
-3
6
-3
-1
P
-2
-2
-2
-2
-1
-4
-2
-2
S
-1
1
-1
0
0
-2
0
-1
T
-1
0
-1
-2
-1
-2
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
Y
-2
-2
-2
-3
-1
3
-3
2
V
-3
-3
-3
-3
-2
-1
-3
-3
“K”
at0position
3 8
-2
-1
-2
gets
a-3score
of
-4
0
-4 2 -3
Note: Assumes positions are independent
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
27
This slide was modified
PSSM Entries = Log-Odds Scores
Observed frequency
of residue “A”
1. Estimate probability of observing
each residue (probability of A given
M, where M is PSSM model)
2. Divide by background probability of
observing each residue (probability
of A given B, where B is background
model)
3. Take log so that can add (rather than
multiply) scores
Foreground model
(i.e., the PSSM)
 Pr  A M  

log 2 



Pr
A
B


BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Background model
9/28/07
28
Statistics References
Statistical Inference (Hardcover)
George Casella, Roger L. Berger
StatWeb:
A Guide to Basic Statistics for Biologists
http://www.dur.ac.uk/stat.web/
Basic Statistics:
http://www.statsoft.com/textbook/stbasic.html
(correlations, tests, frequencies, etc.)
Electronic Statistics Textbook: StatSoft
http://www.statsoft.com/textbook/stathome.html
(from basic statistics to ANOVA to discriminant analysis, clustering, regression
data mining, machine learning, etc.)
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
29
Sequence Profiles
Goal: to characterize sequences belonging to a class
(structural or functional) & determine whether a
query sequence also belongs to that class
• DNA or RNA sequences
• Protein sequences
•
Idea is to provide a "model" of the class against
which we can test the new sequence
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
30
Protein Sequence Profiles & PSSMs
• Profile - a table that lists frequencies of each amino acid
in each position of a protein sequence
• PSSM - a special type of Profile - with no gaps
• Frequencies are calculated from a MSA containing a domain
of interest
• Can be used to generate a consensus sequence
• Derived scoring scheme can be used to align a new sequence
to the profile
• Profile can be used in database searches (PSI-BLAST)
to find new sequences that match the profile
• Profiles can also be used to compute MSAs heuristically (e.g.,
progressive alignment)
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
31
PSI-BLAST Limitations for generating
patterns or "motifs"
• With PSSMs, can't have insertions and deletions
• With Profiles, essentially 'add extra columns' to
PSSM to allow for gaps
• Better approach (for defining domains)?
• Profile HMM: elaborated version of a profile
• Intuitively, a profile that models gaps
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
32
Sequence Motifs (Patterns)
Types of representations?
• √ Consensus Sequence
• √ Sequence Logo - "enhanced"consensus sequence,
in which symbol size  information entropy
• Information entropy??? In information theory, the Shannon
entropy or information entropy is a measure of the [decrease in]
uncertainty associated with a random variable. Entropy quantifies
information in a piece of data.
- Wikipedia
• Check out this interesting website:
Tom Schneider, NCIF
• http://www.ccrnp.ncifcrf.gov/~toms/glossary.html#sequence_logo
• √PSSM - Position-Specific Scoring Matrix
• √Profiles
HMMs - Hidden Markov Models
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
33
HMMs: an example
Nucleotide frequencies in human genome
A
C
T
G
20.4
29.5
20.5
29.6
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
34
CpG Islands
Written CpG to distinguish
from a C≡G base pair)
• CpG dinucleotides are rarer than would be expected
from independent probabilities of C and G (given the
background frequencies in human genome)
• High CpG frequency is sometimes biologically
significant; e.g., sometimes associated with promoter
regions (“start sites”for genes)
• CpG island - a region where CpG dinucleotides are
much more abundant than elsewhere
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
35
Hidden Markov Models - HMMs
Goal: Find most likely explanation for observed variables
Components:
• Observed variables
• Hidden variables
• Emitted symbols
• Emission probabilities
• Transition probabilities
• Graphical representation to illustrate relationships among these
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
36
The Occasionally Dishonest Casino
A casino uses a fair die most of the time, but occasionally
switches to a "loaded" one
• Fair die:
Prob(1) = Prob(2) = . . . = Prob(6) = 1/6
• Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10, Prob(6) = ½
• These are emission probabilities
Transition probabilities
• Prob(Fair  Loaded) = 0.01
• Prob(Loaded  Fair) = 0.2
• Transitions between states obey a Markov process
• (more on Markov chains/models/processes a bit later)
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
37
An HMM for Occasionally Dishonest Casino
Transition probabilities
• Prob(Fair  Loaded) = 0.01
• Prob(Loaded  Fair) = 0.2
Emission probabilities
• Fair die:
Prob(1) = Prob(2) = . . . = Prob(6) = 1/6
• Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10, Prob(6) = ½
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
38
The Occasionally Dishonest Casino
• Known:
• Structure of the model
• Transition probabilities
• Hidden: What casino actually did
• FFFFFLLLLLLLFFFF...
• Observable: Series of die tosses
• 3415256664666153...
• What we must infer:
• When was a fair die used?
• When was a loaded one used?
• Answer is a sequence
FFFFFFFLLLLLLFFF...
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
39
HMM: Making the Inference
• Model assigns a probability to each explanation for the
observation, e.g.:
P(326|FFL)
= P(3|F) · P(FF) · P(2|F) · P(FL) · P(6|L)
= 1/6 · 0.99 · 1/6
· 0.01 · ½
• Maximum Likelihood: Determine which explanation is most likely
• Find path most likely to have produced observed sequence
• Total Probability: Determine probability that observed sequence
was produced by HMM
• Consider all paths that could have produced the observed
sequence
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
40
HMM Notation
• x = sequence of symbols emitted by model
• xi = symbol emitted at time i
•  = path, a sequence of states
•
i-th state in  is i
• akr = probability of making a transition from state k to
state r
akr  Pr( i  r |  i 1  k )
• ek(b) = probability that symbol b is emitted when in state k
ek (b )  Pr(xi  b |  i  k )
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
41
Calculating Different Paths to an
Observed Sequence
x  x1, x2, x3  6,2,6
Pr(x ,  (1) )  a0F eF (6)aFF eF (2)aFF eF (6)
 (1)  FFF

( 2)
 LLL
1
1
1
 0.99   0.99 
6
6
6
 0.00227
 0.5 
Pr(x ,  (2) )  a0 LeL (6)aLLeL (2)aLLeL (6)
 0.5  0.5  0.8  0.1  0.8  0.5
 0.008
 (3)  LFL
Pr(x ,  (3) )  a0LeL (6)aLF eF (2)aFL eL (6)aL 0
1
 0.5  0.5  0.2   0.01  0.5
6
 0.0000417
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
42
Identifying the Most Probable Path
The most likely path * satisfies:
  arg max Pr(x ,  )
*

To find *, consider all possible ways the last "symbol"
of x could have been emitted
Let
v k (i )  Prob. of path  1 , ,  i most likely
Then
to emit x1 , , xi such that  i  k
v k (i )  ek (xi ) max v r (i  1)ark 
r
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
43
Viterbi Algorithm
• Initialization
(i = 0)
v 0 (0)  1, vk (0)  0 for k  0
• Recursion (i = 1, . . . , L): For each state k
v k (i )  ek (xi ) max v r (i  1)ark 
r
• Termination:
Pr(x ,  * )  max vk (L)ak 0 
k
To find *, use trace-back, as in dynamic programming
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
44
Viterbi: Example
6
2
1
0
0
0
(1/6)(1/2)
= 1/12
(1/6)max{(1/12)0.99,
(1/4)0.2}
= 0.01375
(1/6)max{0.013750.99,
0.020.2}
= 0.00226875
0
(1/2)(1/2)
= 1/4
(1/10)max{(1/12)0.01,
(1/4)0.8}
= 0.02
(1/2)max{0.013750.01,
0.020.8}
= 0.08

B

F
L
x
6
0
v k (i )  ek (xi ) max v r (i  1)ark 
r
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
45
Viterbi gets it right more often than not
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
46
An HMM for CpG islands
Emission probabilities are 0 or 1
e.g., eG-(G) = 1, e G-(T) = 0
See Durbin et al., Biological Sequence Analysis,. Cambridge 1998
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
47
Total Probability
Many different paths can result in
observation x
Probability that our model will emit x is
Pr(x )   Pr(x ,  )

Total
Probability
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
48
Viterbi: Example
B

F
L
x

6
2
6
1
0
0
0
0
(1/6)(1/2)
= 1/12
(1/6)max{(1/12)0.99,
(1/4)0.2}
= 0.01375
(1/6)max{0.013750.99,
0.020.2}
= 0.00226875
0
(1/2)(1/2)
= 1/4
(1/10)max{(1/12)0.01,
(1/4)0.8}
= 0.02
(1/2)max{0.013750.01,
0.020.8}
= 0.08
v k (i )  ek (xi ) max v r (i  1)ark 
r
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
49
Total Probability: Example
B

F
L
x

6
2
1
0
0
0
(1/6)(1/2)
= 1/12
(1/6)sum{(1/12)0.99,
(1/4)0.2}
= 0.022083
(1/6)sum{0.0220830.99,
0.0200830.2}
= 0.004313
0
(1/2)(1/2)
= 1/4
(1/10)sum{(1/12)0.01,
(1/4)0.8}
= 0.020083
(1/2)sum{0.0220830.01,
0.0200830.8}
= 0.008144
Total probability =
Pr(x,  )


6
0
= 0.004313 + 0.008144 = 0.012457
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
50
Estimating the probabilities (“training”)
• Baum-Welch algorithm
• Start with initial guess at transition probabilities
• Refine guess to improve the total probability of the training
data in each step
• May get stuck at local optimum
• Special case of expectation-maximization (EM) algorithm
• Viterbi training
• Derive probable paths for training data using Viterbi algorithm
• Re-estimate transition probabilities based on Viterbi path
• Iterate until paths stop changing
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
51
Profile HMMs
• Model a family of sequences
• Derived from a multiple alignment of the family
• Transition and emission probabilities are positionspecific
• Set parameters of model so that total probability
peaks at members of family
• Sequences can be tested for membership in family
using Viterbi algorithm to match against profile
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
52
Profile HMMs
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
53
Pfam
• “A comprehensive collection of protein domains and families,
with a range of well-established uses including genome
annotation.”
• Each family is represented by two multiple sequence
alignments and two profile-Hidden Markov Models (profileHMMs).
• A. Bateman et al. Nucleic Acids Research (2004) Database
Issue 32:D138-D141
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
54
Chp 7 - Protein Motifs & Domain Prediction
SECTION II
SEQUENCE ALIGNMENT
Xiong: Chp 7
Protein Motifs and Domain Prediction
• Identification of Motifs & Domains in Multple Sequence
Alignment
• Motif & Domain Databases Using Regular Expressions
• Motif & Domain Databases Using Statistical Models
• Protein Family Databases
• Motif Discovery in Unaligned Sequences
• Sequence Logos
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
55
HMM for Pairwise Alignment
How do we compute the best alignment?
The best alignment corresponds to the Viterbi path through the HMM.
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
56
Conclusion II
A multiple alignment is the “inverse” of a pairwise
alignment.
• Pairwise alignment:
similar sequences  evolutionary relation
• Multiple alignment:
evolutionary relation  similar sequence positions
BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
9/28/07
57
Download