Genome evolution: a computational approach Lecture 1: Modern challenges in evolution. Markov processes. Amos Tanay, Ziskind 204, ext 3579 עמוס תנאי amos.tanay@weizmann.ac.il http://www.wisdom.weizmann.ac.il/~atanay/GenomeEvo/ The Genome A T G intergenic exon C intron exon intron exon intron exon Triplet Code intergenic Humans and Chimps ~5-7 million years 3X109 {ACGT} 3X109 {ACGT} Genome alignment • Where are the “important” differences? • How did they happen? 9% 1.2% 0.8% 3% 1.5% 0.5% Human Chimp Gorilla Orangutan Gibbon Baboon Macaque Where are the “important” differences? How did new features were gained? Marmoset 0.5% Antibiotic resistance: Staphylococcus aureus Timeline for the evolution of bacterial resistance in an S. aureus patient (Mwangi et al., PNAS 2007) •Skin based •killed 19,000 people in the US during 2005 (more than AIDS) •Resistance to Penicillin: 50% in 1950, 80% in 1960, ~98% today •2.9MB genome, 30K plasmid How do bacteria become resistant to antibiotics? Can we eliminate resistance by better treatment protocols, given understanding of the evolutionary process? Ultimate experiment: sequence the entire genome of the evolving S. aureus Mutations Resistance to Antibiotics Vanco. Rifampi Oxacili Dapto. 20/7 1 0.012 0.75 0.01 20/9 4 16 25 0.05 1/10 6 16 0.75 0.05 6/10 8 16 1.5 1.0 13/10 8 16 0.75 1.0 1 2 3 4-6 7 8 9 10 11 12 13 14 15…18 S. Aureus got found just few “right” mutations and survived multi-antibiotics Yeast Genome duplication • The budding yeast S. cerevisiae genome have extensive duplicates • We can trace a whole genome duplication by looking at yeast species that lack the duplicates (K. waltii, A. gosypii) • Only a small fraction (5%) of the yeast genome remain duplicated • How can an organism tolerate genome duplication and massive gene loss? • Is this critical in evolving new functionality? “Junk” and ultraconservation Baker’s yeast 12MB ~6000 genes The worm c.elegans 100MB ~20,000 genes Humans 3GB ~27,000 genes 1 cell ~1000 cells ~50 trillions cells From: Lynch 2007 intergenic exon intron exon intron exon intron ENCODE Data exon intergenic Grand unifying theory of everything Biology Genomes (phenotype) (genotype) Strings of A,C,G,T (Total DNA on earth: A lot, but only that much) Evolution: bird’s eyes view recombination mutation Species B selection Species A Fitness Ecology (many species) Geography (Communication barriers) Environment (Changing fitness) (Probability, Calculus/Matrix theory, some graph theory, some statistics) Probabilistic models Genome structure Inference Mutations Parameter estimation Population Inferring Selection Course outline Models: Markov chains discrete continuous Bayesian networks Factor Graphs Inference: Dynamic programming Sampling Variational methods Generalized Belief propagation Probabilistic models Genome structure Introduction to the human genome Inference Mutations Point mutations Insertion/Deletions Repeats Parameter estimation Population Basic population genetics Drift/Fitness/Selection Parameter estimation: EM, function optimization Selection Protein coding genes Transcription factor binding sites RNA Networks Things you need to know or catch up with: • Graph theory – • Matrix algebra – • Basic definitions,Trees, Cycles Basic definitions, Eigenvalues Probability – Basic discrete probability, std distributions What you’ll learn: • • • Modern methods for inference in complex probabilistic models in general Intro to genome organization and key concepts in evolution Inferring selection using comparative genomics Books: Graur and Li, Molecular Evolution Lynch, Origin of genome architecture Hartl and Clark, Population genetics Durbin et al. Biological sequence analysis Karlin and Taylor, Markov Processes Freidman and Koller draft textbook (handouts) Papers as we go along.. N. Friedman D. Koller BN and beyond Course duties • 5 exercises, 40% of the grade – Mainly theoretical, math questions, usually ~120 points to collect – Trade 1 exercise for ppt annotations (extensive in-line notes) • 1 Genomic exercise (in pairs) for 10% of the grade – Compare two genomes of your choice: mammals, worms, flies, yeasts, bacteria, plants • Exam: 60% (110% in total) Ancestral Genome sequence 2 Ancestral Genome sequence 1 Model Parameters Model Parameters Genome sequence 1 Genome sequence 2 Genome sequence 3 (0) Modeling the genome sequences Probabilistic modeling: P(data | q) Using few parameters to explain/regenerate most of the data Hidden variables make model explicit and mechanistic (1)Inferring ancestral genomes Based on some model compute the distribution of ancestral genomes (2) Learning an evolutionary model Using extant genomes, learn a “reasonable” model Ancestral Genome sequence 2 Ancestral Genome sequence 1 Model Parameters Genome sequence 1 Genome sequence 2 Genome sequence 3 (1)Decoding the genome Genomic regions with different function evolve differently Learn to read the genome through evolutionary modelling (2)Understanding the evolutionary process The model parameters describe evolution (3) Inferring phylogenies Which tree structure explain the data best? Is it a tree? Probabilities • Our probability space: – DNA/Protein sequences: {A,C,G,T} – Time/populations • Queries: – If a locus have an A at time t, what is the chance it will be C at time t+1? – If a locus have an A in an individual from population P, what is the chance it will be C in another individual from the same population? – What is the chance to find the motif ACGCGT anywhere in a random individual of population P? what is the chance it will remain the same after 2m years? P( ) P( ) Conditional Probability: P( | ) Chain Rule: P( ) P( | ) P( ) Bayes Rule: P( | ) A B P( | ) P( ) P( ) Random Variables & Notation • Val(X) – set of possible values of RV X • Upper case letters denote RVs (e.g., X, Y, Z) • Upper case bold letters denote set of RVs (e.g., X, Y) • Lower case letters denote RV values (e.g., x, y, z) • Lower case bold letters denote RV set values (e.g., x) Stochastic Processes and Stationary Distributions Process Model t Stationary Model Poisson process 0 1 2 3 4 0 1 2 3 Random walk -1 Markov chain A Brownian motion B t C D Discrete time T=1 T=2 T=3 T=4 T=5 Continuous time X R1 , R 2 The Poisson process Events are occurring interpedently in disjoint time intervals Xt : an r.v. that counts the number of events up to time t. p(h) ah o(h), h 0, a 0 Assume: probability of two or more events in time h is o (h ) Now: . Pm (t ) Pr{ X t m} P0 (t h) P0 (t ) P0 (h) P0 (t )(1 p(h)) P0 (t h) P0 (t ) p ( h) P0 (t ) h h P0' (t ) aP0 (t ) P0 (t ) ce at The Poisson process Probability of m events at time t: m Pm (t h) Pm (t ) P 0 (h) Pm 1 (t ) P1 (h) Pm i (t )P i (h) i 2 P1 ( h ) p ( h ) Pm (t h) Pm (t ) Pm (t )[ P0 (h) 1] Pm 1 (t ) P1 (h) o(h) Pm (t ) p (h) Pm 1 (t ) p (h) o(h) Pm (t h) Pm (t ) aPm (t ) aPm 1 (t ) h 0 h P ' m (t ) aPm (t ) aPm 1 (t ) The Poisson process Solving the recurrence: Qm (t ) Pm (t )e at Q'm (t ) aQm 1 (t ) P ' m (t ) aPm (t ) aPm1 (t ) Q0 (0) 1 Qm (0) 0 Q'1 (t ) a Q1 (t ) at c 2 2 a t 2 Q'2 (t ) a t Q2 (t ) c 2 .. a mt m at P m (t ) e m! Markov chains General Stochastic process: X s, Pr( state, time) The Markov property: u t s, knowing X t makes X s and X u indepdent A set of states: Finite or Countable. (e.g., Integers, {A,C,G,T}) Discrete time: T=0,1,2,3,…. P( x, s; t , A) Pr( X t A | X s x) Transition probability P( x, s; t , A) P( x, t s, A) Stationary transition probabilities Pij Pr( X t 1 j | X t i ) ( X t1 h , X t 2 h,..., X tn h) ~ ( X t1 , X t 2,..., X tn) One step transitions Stationary process Markov chains 4 Nucleotides The loaded coin A B T=1 A G C T A B T=2 A G C T A B T=3 A G C T A B T=4 A G C T 20 Amino Acids pab ARNDCEQGHILKMFPSTWYV ARNDCEQGHILKMFPSTWYV 1-pab A B pba 1-pba ARNDCEQGHILKMFPSTWYV ARNDCEQGHILKMFPSTWYV Markov chains Transition matrix P: 1 P0 i P0 1 i 0 1 P1i P1 0 i 1 .. .. .. .. P Pn 0 n0 P0 2 .. P0 n P1 2 .. P1 n .. .. .. .. .. .. Pn 0 .. 1 Pn i in A discrete time Markov chain is completely defined given an initial condition and a probability matrix. The Markov chain graph G is defined on the states. We connect (a,b) whenever Pab>0 ( P )T x Matrix power Distribution after T time steps given x as an initial condition Spectral decomposition T=1 A T=2 A paa B pab T=3 A paa paa pab pba B paa pab pab pbb P 1 P0 i P01 i 0 1 P1i P10 i 1 .. .. .. .. P Pn 0 n0 Right,left eigenvector: Px x, xP x When an eigen-basis exists x (1) , x ( 2) ,..., x ( n ) We can find right eigenvectors: (1) ,.., ( n ) And left eigenvectors: With the eigenvalue spectrum: P02 .. P0 n P12 .. P1 n .. .. .. .. .. .. Pn 0 .. 1 Pni in (1) ,.., ( n ) 1 ,.., n Which are bi-orthogonal: ( (1) , (1) ) k 1ik jk ij And define the spectral decomposition: P n Spectral decomposition P P T ()T ... T 1T 0 T 0 0 T2 0 0 0 Tn To compute transition probabilities: O(|E|)*T ~ O(N2)*T per initial condition T matrix multiplications to preprocess for time T Using spectral decomposition: O(Spectral pre-process) + 2 matrix multiplications per condition Convergence Spec(P) = P’s eignvalues, 1 2... 1 = largest, always = 1. A Markov Chain is irreducible if its underlying graph is connected. In that case there is a single eigenvalue that equals 1. What does the left eigenvector corresponding to 1 represent? Fixed point: out (i , j )E Pi Pij ( j ,i )E Pj Pji in 2 = second largest eigenvalue. Controlling the rate of process convergence Continuous time Think of time steps that are smaller and smaller P( x, s; t , A) Pr( X t A | X s x) t [0, ) Conditions on transitions: Markov Pij (t ) 0 P (t ) 1 ij j P (t ) P (h) P (t h) ik kj ij t, h 0 k 1 i j lim Pij (t ) t 0 0 i j Theorem: 1 Pii (t ) qii t 0 t Pij (t ) Pij ' (0) lim qij t 0 t Pii ' (0) lim exists (may be infinite) exists and finite Kolmogorov Rates and transition probabilities The process’s rate matrix: Q q0 i i 0 q1 0 .. .. q n0 q0 1 q0 2 .. q0 n q1i q1 2 .. q1 n .. .. .. .. .. .. .. .. qn1 qn 2 i 1 .. qn i in Transitions differential equations (backward form): Pij ( s t ) Pij (t ) Pik ( s) Pkj (t ) Pij (t ) k Pik ( s) Pkj (t ) [ Pii ( s) 1]Pij (t ) k i s 0 P'ij (t ) qik Pkj (t ) q ii Pij (t ) k i P' (t ) QP (t ) P(t ) exp( Qt ) Matrix exponential The differential equation: P' (t ) QP (t ) P(t ) exp( Qt ) Series solution: 1 i i exp( Qt ) Q t i 0 i! i i i 1 1 i i (exp( Qt ))' Q t Q Q t Q exp( Qt ) i 0 i! i 0 i! Summing over different path lengths: 1-path 2-path 3-path 4-path 5-path Computing the matrix exponential 1 exp( Qt ) Q i t i i 0 i! 1 1 i i i i Q () t ( t ) exp( t ) i 0 i! i 0 i! e 1 t 0 exp( t ) 0 0 e 2 t 0 0 0 e n t Computing the matrix exponential 1 i i exp( Qt ) Q t i 0 i! Series methods: just take the first k summands reasonable when ||A||<=1 if the terms are converging, you are ok can do scaling/squaring: 1 i Q 0 i! e e Q Eigenvalues/decomposition: good when the matrix is symmetric problems when having similar eigenvalues Multiple methods with other types of B (e.g., triangular) Q m Se B S 1 m Modeling: simple case Learning Inference Modeling Alignment Genome 1 AGCAACAAGTAAGGGAAACTACCCAGAAAA…. AGCCACATGTAACGGTAATAACGCAGAAAA…. Genome 2 Statistics Maximum likelihood model: L(q | D {S1 , S 2 }) Pr( s1[i], s2 [i] | q ) i (Pr( c1 ,c2 )) n ( c1 ,c2 ) A GC T A G C T 1 2 n ( c1 ,c2 ) (Pr( c1 ) exp( Qt )[c1 , c2 ]) c ,c 1 2 LL(q , D) c ,c n(c1 , c2 ) log Pr(c1 , c2 ) c ,c 1 2 arg max q LL(q , D) ? Modeling: simple case Learning Inference Modeling Alignment Genome 1 Genome 2 AGCAACAAGTAAGGGAAACTACCCAGAAAA…. AGCCACATGTAACGGTAATAACGCAGAAAA…. Statistics LL(q , D) c ,c n(c1 , c2 ) log Pr(c1 , c2 ) 1 A G C T arg max q LL(q , D) ? n(c1 , c2 ) Pr( c1 , c2 ) N exp( Qt )i , j A GC T 2 n(i, j ) nij j n11 / n1 n21 / n2 Q log n /n (t=1) 31 3 n / n 41 4 n12 / n1 n13 / n1 n22 / n2 n23 / n2 n32 / n3 n33 / n3 n42 / n4 n43 / n4 n14 / n1 n24 / n2 n34 / n3 n44 / n4 Modeling: but is it kosher? Q,t’ Q,t Q,t+t’ Symmetric processes Definition: we call a Markov process symmetric if its rate matrix is symmetric: i, j Qij Q ji What would a symmetric process converge to? whiteboard/ exercise Reversing time: Pr( X t j | X s i) Pr( X s j | X t i) Reversibility Time: t s Definition: A reversible Markov process is one for which: Pr( X s j | X t i) Pr( X t j | X s i) i j j i such that: i qij j q ji i Claim: A Markov process is reversible iff If this holds, we say the process is in detailed balance. qji i j qij whiteboard/ exercise Reversibility Claim: A Markov process is reversible iff we can write: qij j sij where S is a symmetric matrix. whiteboard/ exercise Q,t’ Q,t Q,t’ Q,t Q,t+t’