DNA Analysis2

advertisement
DNA Analysis
Part II
Amir Golnabi
ENGS 112
Spring 2008
1
What we saw in part I:
1. Markov Chain
2. DNA and Modeling
3. Markovian Models for DNA Sequences
4. HMM for DNA Sequences
Part II:
1. DNA Methylation and CpG islands
2. Markov Chain Model
3. Hidden Markov Model
4. Finding the State Path
5. Parameter Estimation for HMMs
6. References
2
1.DNA Methylation and CpG islands
• CG base pair in the human genome
• Modification of Cytosine by methylation
• High chance of mutation of methyl-C into a T
• CG dinucleotides are rarer in the genome
• Methylation is suppressed in short stretches of the
genome such as around the promoters or start regions
of many genes.  more CG dinucleotides: CpG islands
• "p“: "C" and "G" are connected by a phosphodiester
bond
• Two questions:
– Given a short stretch of genomic sequence, how would
we decide whether it comes from a CpG island?
– Given a long piece of sequence, how would we find the
CpG islands in it?
3
2.Given a short stretch of genomic sequence, how
would we decide whether it comes from a CpG island?
• Markov Chain:
Transition probabilities:
ast  Pxi  t xi 1  s 
Probability of sequences:
Px  PxL , xL1 ,..., x1 
 PxL xL 1  PxL 1 xL  2 ... Px2 x1  Px1   Px1  a xi1xi
L
i 2
• Beginning and end of sequences:
> Silent states
4
Transition probabilities using Maximum likelihood
estimator for CpG islands:
• Two Markov chain models:
1.CpG islands (the ‘+’ model)
2.Remainder of the sequence (the ‘-’ model)
• Table of frequencies:
+
A
C
G
T
-
A
C
G
T
A
0.180
0.274
0.426
0.120
A
0.300
0.205
0.285
0.210
C
0.171
0.368
0.274
0.188
C
0.322
0.298
0.078
0.302
G
0.161
0.339
0.375
0.125
G
0.248
0.246
0.298
0.208
T
0.079
0.355
0.384
0.182
T
0.177
0.239
0.292
0.292
• Each row sums to 1.
• Tables are asymmetric.
5
To use this model for discrimination: Log-odds
ratio:
L
a xi1xi L
P x mod el 


S  x   log
  log
Px mod el  
a
i 1
• x is the sequence

xi 1 xi
  xi1xi
i 1
β
A
C
G
T
• β is the log likelihood
A
-0.740
0.419
0.580
-0.803
ratio is corresponding
C
-0.913
0.302
1.812
-0.685
transition probabilities
G
-0.624
0.461
0.331
-0.730
T
-1.169
0.573
0.339
-0.679
- The histogram of the
length-normalized scores
,S(x), for all the
sequences(~60,000
nucleotides)
6
3. Given a long piece of sequence, how would we
find the CpG islands in it?
• Single model for the entire sequence that
incorporates both Markov chains: HMM
• Similar transition probabilities within each set
• Small chance of switching between + and – regions
• There is no one-to-one correspondence between states
and symbols.
7
• Sequence of states (path Π): Transition probabilities:
– State sequence is hidden in HMM
ast  P i  t  i 1  s 
• Sequence of symbols: emission probabilities:
– Prob. b is seen in state s
es b  Pxi  b  i  s 
– emission prob. of CpG islands: 0 or 1
• A sequence can be generated from a HMM as follows:
– A state 1 is chosen according to a0i
– In
1
an observation is emitted according to
– A new state
2
e 1
is chosen according to a i
1
– and so forth…: A sequence of random observations
– P(x)= prob. X was generated by the model
– Joint probability of an observed seq x and state seq
:
L
P x ,   a0 1  e i  xi  a i i1
i 1
8
• Example: Prob. of sequence ‘CGCG’ being emitted by the
state sequence (C+,G-,C-,G+):
a0 ,C 1  aC ,G 1  aG ,C 1  aC ,G 1  aG ,0
• Not very useful in practice because the path is not
known → Path estimation: By finding the most likely one
– Viterbi Algorithm
– Forward or Backward Algorithm
• Example: CpG model: Generating symbol sequence CGCG
– State sequences: (C+,G+,C+,G+),(C-,G-,C-,G-),
(C+,G-,C-,G+)
– (C+,G-,C-,G+): switching back and forth between + and –
– (C-,G-,C-,G-): small prob. of CG in ‘-’ group
– (C+,G+,C+,G+): Best option!
9
5.Parameter Estimation for HMMs:
HMM models:
1.Design the structure: states and their connections
2.Design parameter values: transition and emission
probabilities, ast and es b 
Baum-Welch And Viterbi training
10
7.References
• Bandyopadhyay, Sanghamitra. Gene Identification: Classical and
Computational Ingelligence Approach. 38 vols. IEEE, JAN2008.
• Durbin, R., S. Eddy, and A. Krogh. Biological Sequence Analysis.
Cambridge: Cambridge University, 1998.
• Koski, Timo. Hidden Markov Models for Bioinformatics. Sweden:
Kluwer Academic , 2001.
• Birney, E. "Hidden Markov models in biological sequence
analysis". July 2001:
• Haussler, David. David Kulp, Martin Reese Frank Eeckman "A
Generalized Hidden Markov Model for the Recognition of Human Genes
in DNA".
• Boufounos, Petros, Sameh El-Difrawy, Dan Ehrlich. "HIDDEN MARKOV
MODELS FOR DNA SEQUENCING".
11
Download