DNA Analysis2

DNA Analysis Part II Amir Golnabi ENGS 112 Spring 2008 1 What we saw in part I: 1. Markov Chain 2. DNA and Modeling 3. Markovian Models for DNA Sequences 4. HMM for DNA Sequences Part II: 1. DNA Methylation and CpG islands 2. Markov Chain Model 3. Hidden Markov Model 4. Finding the State Path 5. Parameter Estimation for HMMs 6. References 2 1.DNA Methylation and CpG islands • CG base pair in the human genome • Modification of Cytosine by methylation • High chance of mutation of methyl-C into a T • CG dinucleotides are rarer in the genome • Methylation is suppressed in short stretches of the genome such as around the promoters or start regions of many genes.  more CG dinucleotides: CpG islands • "p“: "C" and "G" are connected by a phosphodiester bond • Two questions: – Given a short stretch of genomic sequence, how would we decide whether it comes from a CpG island? – Given a long piece of sequence, how would we find the CpG islands in it? 3 2.Given a short stretch of genomic sequence, how would we decide whether it comes from a CpG island? • Markov Chain: Transition probabilities: ast  Pxi  t xi 1  s  Probability of sequences: Px  PxL , xL1 ,..., x1   PxL xL 1  PxL 1 xL  2 ... Px2 x1  Px1   Px1  a xi1xi L i 2 • Beginning and end of sequences: > Silent states 4 Transition probabilities using Maximum likelihood estimator for CpG islands: • Two Markov chain models: 1.CpG islands (the ‘+’ model) 2.Remainder of the sequence (the ‘-’ model) • Table of frequencies: + A C G T - A C G T A 0.180 0.274 0.426 0.120 A 0.300 0.205 0.285 0.210 C 0.171 0.368 0.274 0.188 C 0.322 0.298 0.078 0.302 G 0.161 0.339 0.375 0.125 G 0.248 0.246 0.298 0.208 T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.292 • Each row sums to 1. • Tables are asymmetric. 5 To use this model for discrimination: Log-odds ratio: L a xi1xi L P x mod el    S  x   log   log Px mod el   a i 1 • x is the sequence  xi 1 xi   xi1xi i 1 β A C G T • β is the log likelihood A -0.740 0.419 0.580 -0.803 ratio is corresponding C -0.913 0.302 1.812 -0.685 transition probabilities G -0.624 0.461 0.331 -0.730 T -1.169 0.573 0.339 -0.679 - The histogram of the length-normalized scores ,S(x), for all the sequences(~60,000 nucleotides) 6 3. Given a long piece of sequence, how would we find the CpG islands in it? • Single model for the entire sequence that incorporates both Markov chains: HMM • Similar transition probabilities within each set • Small chance of switching between + and – regions • There is no one-to-one correspondence between states and symbols. 7 • Sequence of states (path Π): Transition probabilities: – State sequence is hidden in HMM ast  P i  t  i 1  s  • Sequence of symbols: emission probabilities: – Prob. b is seen in state s es b  Pxi  b  i  s  – emission prob. of CpG islands: 0 or 1 • A sequence can be generated from a HMM as follows: – A state 1 is chosen according to a0i – In 1 an observation is emitted according to – A new state 2 e 1 is chosen according to a i 1 – and so forth…: A sequence of random observations – P(x)= prob. X was generated by the model – Joint probability of an observed seq x and state seq : L P x ,   a0 1  e i  xi  a i i1 i 1 8 • Example: Prob. of sequence ‘CGCG’ being emitted by the state sequence (C+,G-,C-,G+): a0 ,C 1  aC ,G 1  aG ,C 1  aC ,G 1  aG ,0 • Not very useful in practice because the path is not known → Path estimation: By finding the most likely one – Viterbi Algorithm – Forward or Backward Algorithm • Example: CpG model: Generating symbol sequence CGCG – State sequences: (C+,G+,C+,G+),(C-,G-,C-,G-), (C+,G-,C-,G+) – (C+,G-,C-,G+): switching back and forth between + and – – (C-,G-,C-,G-): small prob. of CG in ‘-’ group – (C+,G+,C+,G+): Best option! 9 5.Parameter Estimation for HMMs: HMM models: 1.Design the structure: states and their connections 2.Design parameter values: transition and emission probabilities, ast and es b  Baum-Welch And Viterbi training 10 7.References • Bandyopadhyay, Sanghamitra. Gene Identification: Classical and Computational Ingelligence Approach. 38 vols. IEEE, JAN2008. • Durbin, R., S. Eddy, and A. Krogh. Biological Sequence Analysis. Cambridge: Cambridge University, 1998. • Koski, Timo. Hidden Markov Models for Bioinformatics. Sweden: Kluwer Academic , 2001. • Birney, E. "Hidden Markov models in biological sequence analysis". July 2001: • Haussler, David. David Kulp, Martin Reese Frank Eeckman "A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA". • Boufounos, Petros, Sameh El-Difrawy, Dan Ehrlich. "HIDDEN MARKOV MODELS FOR DNA SEQUENCING". 11

DNA Analysis2

Related documents

Products

Support

DNA Analysis2

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib