DNA Analysis Part II Amir Golnabi ENGS 112 Spring 2008 1 What we saw in part I: 1. Markov Chain 2. DNA and Modeling 3. Markovian Models for DNA Sequences 4. HMM for DNA Sequences Part II: 1. DNA Methylation and CpG islands 2. Markov Chain Model 3. Hidden Markov Model 4. Finding the State Path 5. Parameter Estimation for HMMs 6. References 2 1.DNA Methylation and CpG islands • CG base pair in the human genome • Modification of Cytosine by methylation • High chance of mutation of methyl-C into a T • CG dinucleotides are rarer in the genome • Methylation is suppressed in short stretches of the genome such as around the promoters or start regions of many genes. more CG dinucleotides: CpG islands • "p“: "C" and "G" are connected by a phosphodiester bond • Two questions: – Given a short stretch of genomic sequence, how would we decide whether it comes from a CpG island? – Given a long piece of sequence, how would we find the CpG islands in it? 3 2.Given a short stretch of genomic sequence, how would we decide whether it comes from a CpG island? • Markov Chain: Transition probabilities: ast Pxi t xi 1 s Probability of sequences: Px PxL , xL1 ,..., x1 PxL xL 1 PxL 1 xL 2 ... Px2 x1 Px1 Px1 a xi1xi L i 2 • Beginning and end of sequences: > Silent states 4 Transition probabilities using Maximum likelihood estimator for CpG islands: • Two Markov chain models: 1.CpG islands (the ‘+’ model) 2.Remainder of the sequence (the ‘-’ model) • Table of frequencies: + A C G T - A C G T A 0.180 0.274 0.426 0.120 A 0.300 0.205 0.285 0.210 C 0.171 0.368 0.274 0.188 C 0.322 0.298 0.078 0.302 G 0.161 0.339 0.375 0.125 G 0.248 0.246 0.298 0.208 T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.292 • Each row sums to 1. • Tables are asymmetric. 5 To use this model for discrimination: Log-odds ratio: L a xi1xi L P x mod el S x log log Px mod el a i 1 • x is the sequence xi 1 xi xi1xi i 1 β A C G T • β is the log likelihood A -0.740 0.419 0.580 -0.803 ratio is corresponding C -0.913 0.302 1.812 -0.685 transition probabilities G -0.624 0.461 0.331 -0.730 T -1.169 0.573 0.339 -0.679 - The histogram of the length-normalized scores ,S(x), for all the sequences(~60,000 nucleotides) 6 3. Given a long piece of sequence, how would we find the CpG islands in it? • Single model for the entire sequence that incorporates both Markov chains: HMM • Similar transition probabilities within each set • Small chance of switching between + and – regions • There is no one-to-one correspondence between states and symbols. 7 • Sequence of states (path Π): Transition probabilities: – State sequence is hidden in HMM ast P i t i 1 s • Sequence of symbols: emission probabilities: – Prob. b is seen in state s es b Pxi b i s – emission prob. of CpG islands: 0 or 1 • A sequence can be generated from a HMM as follows: – A state 1 is chosen according to a0i – In 1 an observation is emitted according to – A new state 2 e 1 is chosen according to a i 1 – and so forth…: A sequence of random observations – P(x)= prob. X was generated by the model – Joint probability of an observed seq x and state seq : L P x , a0 1 e i xi a i i1 i 1 8 • Example: Prob. of sequence ‘CGCG’ being emitted by the state sequence (C+,G-,C-,G+): a0 ,C 1 aC ,G 1 aG ,C 1 aC ,G 1 aG ,0 • Not very useful in practice because the path is not known → Path estimation: By finding the most likely one – Viterbi Algorithm – Forward or Backward Algorithm • Example: CpG model: Generating symbol sequence CGCG – State sequences: (C+,G+,C+,G+),(C-,G-,C-,G-), (C+,G-,C-,G+) – (C+,G-,C-,G+): switching back and forth between + and – – (C-,G-,C-,G-): small prob. of CG in ‘-’ group – (C+,G+,C+,G+): Best option! 9 5.Parameter Estimation for HMMs: HMM models: 1.Design the structure: states and their connections 2.Design parameter values: transition and emission probabilities, ast and es b Baum-Welch And Viterbi training 10 7.References • Bandyopadhyay, Sanghamitra. Gene Identification: Classical and Computational Ingelligence Approach. 38 vols. IEEE, JAN2008. • Durbin, R., S. Eddy, and A. Krogh. Biological Sequence Analysis. Cambridge: Cambridge University, 1998. • Koski, Timo. Hidden Markov Models for Bioinformatics. Sweden: Kluwer Academic , 2001. • Birney, E. "Hidden Markov models in biological sequence analysis". July 2001: • Haussler, David. David Kulp, Martin Reese Frank Eeckman "A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA". • Boufounos, Petros, Sameh El-Difrawy, Dan Ehrlich. "HIDDEN MARKOV MODELS FOR DNA SEQUENCING". 11