ppt1-2

advertisement
Genome evolution:
a computational approach
Lecture 1: Modern challenges in
evolution. Markov processes.
Amos Tanay, Ziskind 204, ext 3579
‫עמוס תנאי‬
amos.tanay@weizmann.ac.il
http://www.wisdom.weizmann.ac.il/~atanay/GenomeEvo/
The Genome
A
T
G
intergenic
exon
C
intron exon intron exon
intron
exon
Triplet
Code
intergenic
Humans and Chimps
~5-7 million years
3X109
{ACGT}
3X109
{ACGT}
Genome alignment
• Where are the “important” differences?
• How did they happen?
9%
1.2%
0.8%
3%
1.5%
0.5%
Human
Chimp
Gorilla
Orangutan
Gibbon
Baboon
Macaque
Where are the
“important”
differences?
How did new
features were
gained?
Marmoset
0.5%
Antibiotic resistance: Staphylococcus aureus
Timeline for the evolution of bacterial resistance in an S. aureus
patient (Mwangi et al., PNAS 2007)
•Skin based
•killed 19,000 people in the US during 2005 (more than AIDS)
•Resistance to Penicillin: 50% in 1950, 80% in 1960, ~98% today
•2.9MB genome, 30K plasmid
How do bacteria become resistant to antibiotics?
Can we eliminate resistance by better treatment protocols, given understanding of the evolutionary process?
Ultimate experiment: sequence the entire genome of the evolving S. aureus
Mutations
Resistance to Antibiotics
Vanco.
Rifampi
Oxacili
Dapto.
20/7
1
0.012
0.75
0.01
20/9
4
16
25
0.05
1/10
6
16
0.75
0.05
6/10
8
16
1.5
1.0
13/10
8
16
0.75
1.0
1 2 3 4-6 7 8 9 10 11 12 13 14 15…18
S. Aureus got found just few “right” mutations and survived multi-antibiotics
Yeast Genome duplication
• The budding yeast S. cerevisiae genome have extensive duplicates
• We can trace a whole genome duplication by looking at yeast
species that lack the duplicates (K. waltii, A. gosypii)
• Only a small fraction (5%) of the yeast genome remain duplicated
•
How can an organism tolerate genome duplication and massive gene loss?
•
Is this critical in evolving new functionality?
“Junk” and ultraconservation
Baker’s yeast
12MB
~6000 genes
The worm
c.elegans
100MB ~20,000 genes
Humans
3GB ~27,000 genes
1 cell
~1000 cells
~50 trillions cells
From: Lynch 2007
intergenic
exon
intron exon intron exon
intron
ENCODE Data
exon
intergenic
Grand unifying theory of everything
Biology
Genomes
(phenotype)
(genotype)
Strings of A,C,G,T
(Total DNA on earth:
A lot, but only that much)
Evolution: bird’s eyes view
recombination
mutation
Species B
selection
Species A
Fitness
Ecology
(many species)
Geography
(Communication barriers)
Environment
(Changing fitness)
(Probability, Calculus/Matrix theory, some graph theory, some statistics)
Probabilistic models
Genome structure
Inference
Mutations
Parameter estimation
Population
Inferring Selection
Course outline
Models:
Markov chains
discrete
continuous
Bayesian networks
Factor Graphs
Inference:
Dynamic programming
Sampling
Variational methods
Generalized Belief
propagation
Probabilistic models
Genome structure
Introduction to the
human genome
Inference
Mutations
Point mutations
Insertion/Deletions
Repeats
Parameter estimation
Population
Basic population genetics
Drift/Fitness/Selection
Parameter estimation:
EM, function optimization
Selection
Protein coding genes
Transcription factor binding sites
RNA
Networks
Things you need to know or catch up with:
•
Graph theory
–
•
Matrix algebra
–
•
Basic definitions,Trees, Cycles
Basic definitions, Eigenvalues
Probability
–
Basic discrete probability, std
distributions
What you’ll learn:
•
•
•
Modern methods for inference in complex probabilistic models in general
Intro to genome organization and key concepts in evolution
Inferring selection using comparative genomics
Books:
Graur and Li, Molecular Evolution
Lynch, Origin of genome architecture
Hartl and Clark, Population genetics
Durbin et al. Biological sequence analysis
Karlin and Taylor, Markov Processes
Freidman and Koller draft textbook (handouts)
Papers as we go along..
N. Friedman
D. Koller
BN and
beyond
Course duties
• 5 exercises, 40% of the grade
– Mainly theoretical, math questions, usually ~120 points to collect
– Trade 1 exercise for ppt annotations (extensive in-line notes)
• 1 Genomic exercise (in pairs) for 10% of the grade
– Compare two genomes of your choice: mammals, worms, flies,
yeasts, bacteria, plants
• Exam: 60%
(110% in total)
Ancestral
Genome
sequence 2
Ancestral
Genome
sequence 1
Model
Parameters
Model
Parameters
Genome
sequence 1
Genome
sequence 2
Genome
sequence 3
(0) Modeling the genome sequences
Probabilistic modeling: P(data | q)
Using few parameters to explain/regenerate most of the data
Hidden variables make model explicit and mechanistic
(1)Inferring ancestral genomes
Based on some model compute the distribution of ancestral genomes
(2) Learning an evolutionary model
Using extant genomes, learn a “reasonable” model
Ancestral
Genome
sequence 2
Ancestral
Genome
sequence 1
Model
Parameters
Genome
sequence 1
Genome
sequence 2
Genome
sequence 3
(1)Decoding the genome
Genomic regions with different function evolve differently
Learn to read the genome through evolutionary modelling
(2)Understanding the evolutionary process
The model parameters describe evolution
(3) Inferring phylogenies
Which tree structure explain the data best?
Is it a tree?
Probabilities
•
Our probability space:
– DNA/Protein sequences: {A,C,G,T}
– Time/populations
•
Queries:
– If a locus have an A at time t, what is the chance it will be C at time t+1?
– If a locus have an A in an individual from population P, what is the chance it will
be C in another individual from the same population?
– What is the chance to find the motif ACGCGT anywhere in a random individual
of population P? what is the chance it will remain the same after 2m years?
P(   )
P( )
Conditional Probability:
P(  |  ) 
Chain Rule:
P(   )  P(  |  ) P( )
Bayes Rule:
P(  |  ) 
A
B
P( |  ) P(  )
P( )
Random Variables & Notation
•
Val(X) – set of possible values of RV X
•
Upper case letters denote RVs (e.g., X, Y, Z)
•
Upper case bold letters denote set of RVs (e.g., X, Y)
•
Lower case letters denote RV values (e.g., x, y, z)
•
Lower case bold letters denote RV set values (e.g., x)
Stochastic Processes and Stationary Distributions
Process
Model
t
Stationary
Model
Poisson process
0
1
2
3
4
0
1
2
3
Random walk
-1
Markov chain
A
Brownian motion
B
t
C
D
Discrete time
T=1
T=2
T=3
T=4
T=5
Continuous time
X  R1 , R 2
The Poisson process
Events are occurring interpedently in disjoint time intervals
Xt
: an r.v. that counts the number of events up to time t.
p(h)  ah  o(h), h  0, a  0
Assume:
probability of two or more events in time h is o (h )
Now:
.
Pm (t )  Pr{ X t  m}
P0 (t  h)  P0 (t ) P0 (h)  P0 (t )(1  p(h))
P0 (t  h)  P0 (t )
p ( h)
  P0 (t )
h
h
P0' (t )  aP0 (t )  P0 (t )  ce  at
The Poisson process
Probability of m events at time t:
m
Pm (t  h)  Pm (t ) P 0 (h)  Pm 1 (t ) P1 (h)   Pm i (t )P i (h)
i 2
P1 ( h )  p ( h )
Pm (t  h)  Pm (t )  Pm (t )[ P0 (h)  1]  Pm 1 (t ) P1 (h)  o(h)
  Pm (t ) p (h)  Pm 1 (t ) p (h)  o(h)
Pm (t  h)  Pm (t )
  aPm (t )  aPm 1 (t ) h  0
h
P ' m (t )   aPm (t )  aPm 1 (t )
The Poisson process
Solving the recurrence:
Qm (t )  Pm (t )e at
Q'm (t )  aQm 1 (t )
P ' m (t )  aPm (t )  aPm1 (t )
Q0 (0)  1 Qm (0)  0
Q'1 (t )  a  Q1 (t )  at  c
2 2
a
t
2
Q'2 (t ) a t  Q2 (t ) 
c
2
..
a mt m  at
P m (t ) 
e
m!
Markov chains
General Stochastic process:
X s, Pr( state, time)
The Markov property: u  t  s, knowing X t makes X s and X u indepdent
A set of states: Finite or Countable. (e.g., Integers, {A,C,G,T})
Discrete time: T=0,1,2,3,….
P( x, s; t , A)  Pr( X t A | X s  x)
Transition probability
P( x, s; t , A)  P( x, t  s, A)
Stationary transition probabilities
Pij  Pr( X t 1 j | X t  i )
( X t1 h , X t 2 h,..., X tn h) ~ ( X t1 , X t 2,..., X tn)
One step transitions
Stationary process
Markov chains
4 Nucleotides
The loaded coin
A
B
T=1
A
G
C
T
A
B
T=2
A
G
C
T
A
B
T=3
A
G
C
T
A
B
T=4
A
G
C
T
20 Amino Acids
pab
ARNDCEQGHILKMFPSTWYV
ARNDCEQGHILKMFPSTWYV
1-pab
A
B
pba
1-pba
ARNDCEQGHILKMFPSTWYV
ARNDCEQGHILKMFPSTWYV
Markov chains

Transition matrix P:
 1 P0 i
P0 1

 i 0
1 P1i
 P1 0

i 1

..
..

..
..

 P
Pn 0
n0



P0 2 ..
P0 n
P1 2
..
P1 n
..
..
..
..
..
..
Pn 0 .. 1
 Pn i
in











A discrete time Markov chain is completely defined given an initial condition
and a probability matrix.
The Markov chain graph G is defined on the states. We connect (a,b)
whenever Pab>0
( P )T x
Matrix power
Distribution after T time steps given x as an initial condition
Spectral decomposition
T=1
A
T=2
A
paa
B
pab

T=3
A
paa paa  pab pba
B
paa pab  pab pbb
P
 1 P0 i
P01

 i 0
1 P1i
 P10

i 1

..
..

..
..

 P
Pn 0
n0



Right,left eigenvector:
Px  x, xP  x
When an eigen-basis exists
x (1) , x ( 2) ,..., x ( n )
We can find right eigenvectors:
 (1) ,..,  ( n )
And left eigenvectors:
With the eigenvalue spectrum:
P02 ..
P0 n
P12
..
P1 n
..
..
..
..
..
..
Pn 0 .. 1
 Pni
in











 (1) ,.., ( n )
1 ,.., n
Which are bi-orthogonal:
( (1) , (1) )  k 1ik jk   ij
And define the spectral decomposition:
P  
n
Spectral decomposition
P  
P T  ()T  ...  T 
 1T

0
T
 
 
0

0
T2

0
0 
 0 

  
 Tn 

To compute transition probabilities:
O(|E|)*T ~ O(N2)*T per initial condition
T matrix multiplications to preprocess for time T
Using spectral decomposition:
O(Spectral pre-process) + 2 matrix multiplications per condition
Convergence
Spec(P) = P’s eignvalues, 1  2...
1 = largest, always = 1.
A Markov Chain is irreducible if its underlying graph is connected. In that
case there is a single eigenvalue that equals 1.
What does the left eigenvector corresponding to 1 represent?
Fixed point:
out  (i , j )E Pi Pij  ( j ,i )E Pj Pji  in
2 = second largest eigenvalue. Controlling the rate of process convergence
Continuous time
Think of time steps that are smaller and smaller
P( x, s; t , A)  Pr( X t A | X s  x) t [0, )
Conditions on transitions:
Markov
Pij (t )  0
 P (t )  1
ij
j
 P (t ) P (h)  P (t  h)
ik
kj
ij
t, h  0
k
1 i  j
lim Pij (t )  
t 0
0 i  j
Theorem:
1  Pii (t )
 qii
t 0
t
Pij (t )
Pij ' (0)  lim
 qij
t 0
t
 Pii ' (0)  lim
exists (may be infinite)
exists and finite
Kolmogorov
Rates and transition probabilities
The process’s rate matrix:

Q
  q0 i

 i 0
 q1 0


..

..

 q
n0


q0 1

q0 2 ..
q0 n
 q1i
q1 2
..
q1 n
..
..
..
..
..
..
..
..
qn1
qn 2
i 1
.. 
 qn i
in











Transitions differential equations (backward form):
Pij ( s  t )  Pij (t )   Pik ( s) Pkj (t )  Pij (t )
k
  Pik ( s) Pkj (t )  [ Pii ( s)  1]Pij (t )
k i
s  0  P'ij (t )   qik Pkj (t ) q ii Pij (t )
k i
P' (t )  QP (t )  P(t )  exp( Qt )
Matrix exponential
The differential equation:
P' (t )  QP (t )  P(t )  exp( Qt )
Series solution:

1 i i
exp( Qt )   Q t
i  0 i!


i i i 1
1 i i
(exp( Qt ))'   Q t  Q  Q t  Q exp( Qt )
i  0 i!
i  0 i!
Summing over
different path
lengths:
1-path
2-path
3-path
4-path
5-path
Computing the matrix exponential

1
exp( Qt )   Q i t i
i  0 i!


1
1 i i
i i
Q     () t   (  t )   exp( t )
i  0 i!
i  0 i!
 e 1 t

 0
exp( t )  
 
 0


0
e 2 t 
0
0



0
 e n t







Computing the matrix exponential

1 i i
exp( Qt )   Q t
i  0 i!
Series methods: just take the first k summands
reasonable when ||A||<=1
if the terms are converging, you are ok
can do scaling/squaring:
1 i
Q 0
i!

e   e

Q
Eigenvalues/decomposition:
good when the matrix is symmetric
problems when having similar eigenvalues
Multiple methods with other types of B (e.g., triangular)
Q
m
Se B S 1




m
Modeling: simple case
Learning
Inference
Modeling
Alignment
Genome
1
AGCAACAAGTAAGGGAAACTACCCAGAAAA….
AGCCACATGTAACGGTAATAACGCAGAAAA….
Genome
2
Statistics
Maximum likelihood model:
L(q | D  {S1 , S 2 })   Pr( s1[i], s2 [i] | q )
i
  (Pr( c1 ,c2 )) n ( c1 ,c2 )
A GC T
A
G
C
T
1 2


n ( c1 ,c2 )
   (Pr( c1 ) exp( Qt )[c1 , c2 ])

 c ,c

 1 2

LL(q , D)  c ,c n(c1 , c2 ) log Pr(c1 , c2 )
c ,c
1
2
arg max q LL(q , D) ?
Modeling: simple case
Learning
Inference
Modeling
Alignment
Genome
1
Genome
2
AGCAACAAGTAAGGGAAACTACCCAGAAAA….
AGCCACATGTAACGGTAATAACGCAGAAAA….
Statistics
LL(q , D)  c ,c n(c1 , c2 ) log Pr(c1 , c2 )
1
A
G
C
T
arg max q LL(q , D) ?
n(c1 , c2 )
Pr( c1 , c2 ) 
N
exp( Qt )i , j
A GC T
2
n(i, j )

 nij
j
 n11 / n1

 n21 / n2
Q  log 
n /n
(t=1)
 31 3
n / n
 41 4
n12 / n1
n13 / n1
n22 / n2
n23 / n2
n32 / n3
n33 / n3
n42 / n4
n43 / n4
n14 / n1 

n24 / n2 
n34 / n3 

n44 / n4 
Modeling: but is it kosher?
Q,t’
Q,t
Q,t+t’
Symmetric processes
Definition: we call a Markov process symmetric if its rate matrix is symmetric:
i, j Qij  Q ji
What would a symmetric process converge to?
whiteboard/
exercise
Reversing time:
Pr( X t  j | X s  i)
Pr( X s  j | X t  i)
Reversibility
Time: t  s
Definition: A reversible Markov process is one for which:
Pr( X s  j | X t  i)  Pr( X t  j | X s  i)
i
j
j
 i such that:
 i qij   j q ji
i
Claim: A Markov process is reversible iff
If this holds, we say the process is in detailed balance.
qji
i
j
qij
whiteboard/
exercise
Reversibility
Claim: A Markov process is reversible iff we can write:
qij   j sij
where S is a symmetric matrix.
whiteboard/
exercise
Q,t’
Q,t
Q,t’
Q,t
Q,t+t’
Download