Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis http://odin.mdacc.tmc.edu/~llzhang/RiceCourse A loaded die Observed data: 213444244423563232546344443444235632321242454344426 Markov chain and Hidden Markov Model • Markov chain: A Markov chain is a sequence of random variables X1, X2, X3, ... with the Markov property, namely that, given the present state, the future and past states are independent. HMM (Hidden Markov Model): Probabilistic parameters of a hidden Markov model (example) x — states y — possible observations a — state transition probabilities b — output probabilities A simple example of HMM • • • Consider two friends, Alice and Bob, who live far apart from each other and who talk together daily over the telephone about what they did that day. Bob is only interested in three activities: walking in the park, shopping, and cleaning his apartment. The choice of what to do is determined exclusively by the weather on a given day. Alice has no definite information about the weather where Bob lives, but she knows general trends. Based on what Bob tells her he did each day, Alice tries to guess what the weather must have been like. Alice believes that the weather operates as a discrete Markov chain. There are two states, "Rainy" and "Sunny", but she cannot observe them directly, that is, they are hidden from her. On each day, there is a certain chance that Bob will perform one of the following activities, depending on the weather: "walk", "shop", or "clean". Since Bob tells Alice about his activities, those are theobservations. The entire system is that of a hidden Markov model (HMM). Alice knows the general weather trends in the area, and what Bob likes to do on average. In other words, the parameters of the HMM are known. Load HMM package in R #load HMM package: source("http://bioconductor.org/biocLite.R") biocLite("HMM") library("HMM") Documentation: Infer HMM parameters and hidden states from observations # Initialize HMM hmm = initHMM(c("A","B"),c("L","R"), transProbs=matrix(c(.9,.1,.1,.9),2), emissionProbs=matrix(c(.5,.51,.5,.49),2)) print(hmm) # Sequence of observation a = sample(c(rep("L",10),rep("R",30))) b = sample(c(rep("L",30),rep("R",10))) observation = c(a,b) #take a look at the sequences: paste(a, collapse='') paste(b, collapse='') # Baum-Welch bw = baumWelch(hmm,observation,10) print(bw$hmm) #Observations: "RLRRLRRLRRLRRRRRRRRRLLRRRLLRRRRLRRRRLRRR" "LLLLLLLRRLLLRLLLLLLLRLRRRLLLLLLLRRLLLLLR” … $transpos from A B A 0.972790696 0.0272093 B 0.003085187 0.9969148 $emissionProbs symbols states L R A 0.2585718 0.7414282 B 0.7334568 0.2665432 paste(viterbi(hmm,observation), collapse='') "BBBBBBBBBBBBBBBBBBBBAABBBAABBBBBBBBBB BBBAAAAAAAAAAAAAAAAAAAAAABBBAAAAAAAAA AAAAAA" Generate simulated observation from a given HMM Simulation > simHMM(hmm, 20) $states [1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "A" "A" "A" $observation [1] "L" "L" "L" "L" "L" "L" "R" "L" "R" "L" "L" "L" "R" "R" "R" "L" "R" "L" "L" "L" > hmm = initHMM(c("A","B"), c("L","R"), transProbs=matrix(c(.9,.1,.1,.9),2), + emissionProbs=matrix(c(.9,.1,.1,.9),2)) > simHMM(hmm, 20) $states [1] "B" "B" "B" "B" "B" "B" "A" "B" "B" "B" "B" "A" "A" "A" "A" "A" "A" "A" "B" "B" $observation [1] "R" "R" "R" "R" "R" "R" "R" "R" "R" "R" "R" "L" "L" "L" "L" "L" "L" "L" "L" "L" Homework Dishonest casino demo On transprob and emission High entropy in emission weak surrogate marker of hidden states, hard to uncover hidden states. High entropy in transprob Lack of coherence in states, also makes it hard to uncover hidden states. Given enough sample size, the hmm parameters can be determined accurately, but the hidden states cannot be uncovered in high entropy cases. HMM to identify Gene in E Coli • E coli genome: Observed data The model The model with long intergenic region Hmm models of Exons and Introns The model Nucleosome binding on DNA HMM of nucleosome binding • Genomic Sequence Is Highly Predictive of Local Nucleosome Depletion • The regulation of DNA accessibility through nucleosome positioning is important for transcription control. Computational models have been developed to predict genome-wide nucleosome positions from DNA sequences, but these models consider only nucleosome sequences, which may have limited their power. We developed a statistical multi-resolution approach to identify a sequence signature, called the N-score, that distinguishes nucleosome binding DNA from non-nucleosome DNA. This new approach has significantly improved the prediction accuracy. The sequence information is highly predictive for local nucleosome enrichment or depletion, whereas predictions of the exact positions are only modestly more accurate than a null model, suggesting the importance of other regulatory factors in fine-tuning the nucleosome positions. The N-score in promoter regions is negatively correlated with gene expression levels. Regulatory elements are enriched in low N-score regions. While our model is derived from yeast data, the N-score pattern computed from this model agrees well with recent high-resolution protein-binding data in human. Comparing different models GENESCAN model • Genscan is a gene-prediction algorithm that, like other hidden Markov models (HMMs), models the transition probabilities from one part (state) of a gene to another. Here, each circle or square represents a functional unit (a state) of a gene on its forward strand (for example, Einit is the 5' coding sequence (CDS) and Eterm is the 3' CDS, and the arrows represent the transition probability from one state to another. The Genscan algorithm is trained by pre-computing the transition probabilities from a set of known gene structures. Test sequence data can then be run one base position at a time, and the model will predict the optimal state for that position. The model for the reverse strand (beneath the dashed line) is in mirror symmetry to the model shown, with respect to the horizontal axis. Please note that these 'UTRs' (untranslated regions) might contain introns and so should not be confused with the standard UTR. E, exon; I, intron; pro, promoter. Limitations of HMM • Too many parameters • First order HMMs ignore dependencies between hidden states