Bioinformatics lectures at Rice University

advertisement
Bioinformatics lectures at Rice
University
Li Zhang
Lecture 7: HMM models and application in DNA
sequence analysis
http://odin.mdacc.tmc.edu/~llzhang/RiceCourse
A loaded die
Observed data:
213444244423563232546344443444235632321242454344426
Markov chain and Hidden Markov
Model
• Markov chain:
A Markov chain is a sequence of random variables X1, X2, X3, ... with the
Markov property, namely that, given the present state, the future and
past states are independent.
HMM (Hidden Markov Model):
Probabilistic parameters of a hidden Markov model (example)
x — states
y — possible observations
a — state transition probabilities
b — output probabilities
A simple example of HMM
•
•
•
Consider two friends, Alice and Bob, who live far apart from each other and who talk
together daily over the telephone about what they did that day. Bob is only interested in
three activities: walking in the park, shopping, and cleaning his apartment. The choice of
what to do is determined exclusively by the weather on a given day. Alice has no definite
information about the weather where Bob lives, but she knows general trends. Based on
what Bob tells her he did each day, Alice tries to guess what the weather must have been
like.
Alice believes that the weather operates as a discrete Markov chain. There are two states,
"Rainy" and "Sunny", but she cannot observe them directly, that is, they are hidden from
her. On each day, there is a certain chance that Bob will perform one of the following
activities, depending on the weather: "walk", "shop", or "clean". Since Bob tells Alice
about his activities, those are theobservations. The entire system is that of a hidden
Markov model (HMM).
Alice knows the general weather trends in the area, and what Bob likes to do on average.
In other words, the parameters of the HMM are known.
Load HMM package in R
#load HMM package:
source("http://bioconductor.org/biocLite.R")
biocLite("HMM")
library("HMM")
Documentation:
Infer HMM parameters and hidden
states from observations
# Initialize HMM
hmm = initHMM(c("A","B"),c("L","R"),
transProbs=matrix(c(.9,.1,.1,.9),2),
emissionProbs=matrix(c(.5,.51,.5,.49),2))
print(hmm)
# Sequence of observation
a = sample(c(rep("L",10),rep("R",30)))
b = sample(c(rep("L",30),rep("R",10)))
observation = c(a,b)
#take a look at the sequences:
paste(a, collapse='')
paste(b, collapse='')
# Baum-Welch
bw = baumWelch(hmm,observation,10)
print(bw$hmm)
#Observations:
"RLRRLRRLRRLRRRRRRRRRLLRRRLLRRRRLRRRRLRRR"
"LLLLLLLRRLLLRLLLLLLLRLRRRLLLLLLLRRLLLLLR”
…
$transpos
from
A
B
A 0.972790696 0.0272093
B 0.003085187 0.9969148
$emissionProbs
symbols
states
L
R
A 0.2585718 0.7414282
B 0.7334568 0.2665432
paste(viterbi(hmm,observation), collapse='')
"BBBBBBBBBBBBBBBBBBBBAABBBAABBBBBBBBBB
BBBAAAAAAAAAAAAAAAAAAAAAABBBAAAAAAAAA
AAAAAA"
Generate simulated observation
from a given HMM
Simulation
> simHMM(hmm, 20)
$states
[1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "A" "A" "A"
$observation
[1] "L" "L" "L" "L" "L" "L" "R" "L" "R" "L" "L" "L" "R" "R" "R" "L" "R" "L" "L" "L"
> hmm = initHMM(c("A","B"), c("L","R"), transProbs=matrix(c(.9,.1,.1,.9),2),
+ emissionProbs=matrix(c(.9,.1,.1,.9),2))
> simHMM(hmm, 20)
$states
[1] "B" "B" "B" "B" "B" "B" "A" "B" "B" "B" "B" "A" "A" "A" "A" "A" "A" "A" "B" "B"
$observation
[1] "R" "R" "R" "R" "R" "R" "R" "R" "R" "R" "R" "L" "L" "L" "L" "L" "L" "L" "L" "L"
Homework
Dishonest casino demo
On transprob and emission
High entropy in emission  weak surrogate marker of hidden states,
hard to uncover hidden states.
High entropy in transprob  Lack of coherence in states, also makes it
hard to uncover hidden states.
Given enough sample size, the hmm parameters can be determined
accurately, but the hidden states cannot be uncovered in high entropy
cases.
HMM to identify Gene in E Coli
• E coli genome:
Observed data
The model
The model with long intergenic region
Hmm models of Exons and Introns
The model
Nucleosome binding on DNA
HMM of nucleosome binding
• Genomic Sequence Is Highly Predictive of Local Nucleosome Depletion
• The regulation of DNA accessibility through nucleosome positioning is
important for transcription control. Computational models have been
developed to predict genome-wide nucleosome positions from DNA
sequences, but these models consider only nucleosome sequences, which
may have limited their power. We developed a statistical multi-resolution
approach to identify a sequence signature, called the N-score, that
distinguishes nucleosome binding DNA from non-nucleosome DNA. This
new approach has significantly improved the prediction accuracy. The
sequence information is highly predictive for local nucleosome enrichment
or depletion, whereas predictions of the exact positions are only modestly
more accurate than a null model, suggesting the importance of other
regulatory factors in fine-tuning the nucleosome positions. The N-score in
promoter regions is negatively correlated with gene expression levels.
Regulatory elements are enriched in low N-score regions. While our model
is derived from yeast data, the N-score pattern computed from this model
agrees well with recent high-resolution protein-binding data in human.
Comparing different models
GENESCAN model
•
Genscan is a gene-prediction algorithm that,
like other hidden Markov models (HMMs),
models the transition probabilities from one
part (state) of a gene to another. Here, each
circle or square represents a functional unit
(a state) of a gene on its forward strand (for
example, Einit is the 5' coding sequence (CDS)
and Eterm is the 3' CDS, and the arrows
represent the transition probability from one
state to another. The Genscan algorithm is
trained by pre-computing the transition
probabilities from a set of known gene
structures. Test sequence data can then be
run one base position at a time, and the
model will predict the optimal state for that
position. The model for the reverse strand
(beneath the dashed line) is in mirror
symmetry to the model shown, with respect
to the horizontal axis. Please note that these
'UTRs' (untranslated regions) might contain
introns and so should not be confused with
the standard UTR. E, exon; I, intron; pro,
promoter.
Limitations of HMM
• Too many parameters
• First order HMMs ignore dependencies
between hidden states
Download