Statistical analysis of transcription regulatory modules Mayetri Gupta University of North Carolina at Chapel Hill. joint work with Jun S. Liu (Harvard University) gupta@bios.unc.edu -- p.1/24 Upstream regulation ↔ downstream expression Fundamental question: how can we understand the biological mechanisms leading to disease? ...gtggtTAGAATagcgactgttttt... gene 1 ...taggTATAATacagtctgacaaaa... gene 2 ...cagcaacattgaTATAATtgccat... gene 3 ...ctaaaacaatTATTATttatcagg... gene 4 |TATAAT| G 1 2 3 4 5 6 bits 0 G 1 CT 2 Co-regulated genes share similar motif patterns gupta@bios.unc.edu -- p.2/24 Transcription regulation: DNA motifs Proteins bind to DNA to activate transcription G 1 2 3 4 5 6 bits 0 |TATAAT| G 1 CT 2 Position specific weight matrix (PSWM) or Motif gupta@bios.unc.edu -- p.3/24 Transcription regulation in eukaryotes Harder problem: many transcription factors working in co-ordination LARGE sequence search space + WEAK motifs: → many false positives/negatives? gupta@bios.unc.edu -- p.4/24 Regulatory motif discovery Every non-site position multinomial with θ0 = (θ01 , . . . , θ04 ) · · · θ0 θ0 θ1 · · · θ 6 θ0 θ0 · · · Every motif position i multinomial with θi = (θi1 , . . . , θi4 ) Product Multinomial model Challenge: Find position of sites and θ’s MEME (Bailey, ISMB 1994) Gibbs Motif Sampler (Liu, JASA 1995), AlignAce (Roth, Nat. BioTech 1998), BioProspector (Liu, Pac. Symp. Biocomp. 2001), Stochastic Dictionary (Gupta, JASA 2003) gupta@bios.unc.edu -- p.5/24 Motif ≡ PSWM Position-Specific Weight Matrix Θ= A C G T .2 .0 .0 .8 0 0 0 1 .0 .1 .8 .1 .95 .05 .0 .0 0 .99 .01 0 .9 .0 .0 .1 TTGACA [PROB = 0.5417] ATGACT [PROB = 0.0150] TTTAGT [PROB = 7.6e-05] Columns assumed independent gupta@bios.unc.edu -- p.6/24 Motif discovery A1 A2 A3 A N = ?? Motif site location in sequences A = (A1 , . . . AN ) Sample A|Θ, then Θ|A Variants/extensions ALIGNACE, Roth 1998 BIOPROSPECTOR, Liu 2001 gupta@bios.unc.edu -- p.7/24 Motif → regulatory module Combinatorial regulation in complex organisms Spatial “clusters” of motifs: “modules” Usual motif search methods may miss weaker sites detect low-complexity repeats ( ) ( ( sequence search space is much larger gupta@bios.unc.edu -- p.8/24 ) ) Example: combinatorial regulation in CNS development experiments determined Sim:Tango (ACGTG ) binding sites in Drosophila midline-expressed genes at least one more regulatory element indicated gupta@bios.unc.edu -- p.9/24 A model-selection-based approach Gibbs sampler on B. Subtilis data Θ7 Θ8 Θ9 Θ10 Θ11 Θ12 Θ13 Θ14 Θ15 −73 −23 Θ4 Θ5 Θ6 −123 distance from transcription position of site along sequencestart site 27 Θ1 Θ2 Θ3 1 21 41 61 81 101 121 141 sequence index Sequence Index Determine true module of size K from starting set of {Θ1 , . . . , Θ15 } (K, Θ unknown) gupta@bios.unc.edu -- p.10/24 Sequence model if K is known K PSWMs with Θ = (Θ1 , . . . , ΘK ) Occurrence probabilities ρ; Transition probabilities between PSWM types T : (τ kl ) Missing data: A = (Ai j ); Ai j = j-th site in sequence i A ∼ P(·|λ) Then likelihood of sequence S : P(S | φ) = ∑ ∑ P(S | φ, A, T )P(A|φ)P(T |A, φ) A T φ = (ρ, Θ, τ, λ) gupta@bios.unc.edu -- p.11/24 Hidden markov model (HMM) for modules S ( ( τ 1E τ S1 ) ( τ S2 E τ ) ) M PSfrag replacements τ τ 11 τ 12 1 τ 21 M 2E 2 τ 22 Sequence model is a hidden Markov model (HMM) with A following a Markov chain “Clustering”: Site probability f decreases with distance from nearest site, P(Ai j − Ai, j−1 ) is assumed geometric(λ) gupta@bios.unc.edu -- p.12/24 Parameter estimation Problems are two-fold: K is unknown, we only have D ≥ K “approximate” starting PSWMs Estimate φ = (Θ, ρ, λ, τ): dimensionality changes with K Define (u1 , . . . , uD ); u j = 1 if motif j in the module With K known, Gibbs sampler (e.g. Thompson et al, Genome Res. 2004; Zhou and Wong, PNAS 2004) gupta@bios.unc.edu -- p.13/24 Prior model (K unknown) Dirichlet, Product Dirichlet, Beta prior distributions on parameters φ = (ρ, Θ, τ, λ) u ∼ Bernoulli(·) Posterior distribution of all parameters: P(φ, u, A|S ) ∝ P(S |u, φ, A)p(φ, A|u)p(u) gupta@bios.unc.edu -- p.14/24 Discovery of regulatory modules CHALLENGES: ? ? (A) Total number of ? ? ( ( ? ( ) ( ? ) ? ) Update motif sites within modules ) From total of D motifs, choose d for module potential motifs D may be LARGE! Given Θ, ρ, τ, λ sample A (B) Naive posterior computation: Given d and A, ( ) ( ( exponential order ) ) Update ρ,τ, λ From aligned sites, update parameters in sequence length Θ gupta@bios.unc.edu -- p.15/24 (A) Evolutionary Monte Carlo to update u Liang and Wong, Stat. Sinica 2000 Update “Population” using 2 kinds of “moves” Starting set: 4 possible module types "units" ( u1 1 1 1 ) 1 0 1 0 ( u3 1 0 unit 2 and 3 at position 3 0 0 1 1 1 1 ( ) u2 1 0 0 0 ( ) 1 1 1 u4 Motif selection : choose d out of crossover ) 0 0 0 ( ) u 1 between units 1 2 and 4 lead ( to a new configuration u 2 1 ( ) 0 1 u3 1 ( u4 mutation ) 1 ) u1 1 ( u2 ( ) 1 1 1 1 ) 1 0 0 1 ( u3 ) 1 1 1 1 ( u4 ) 0 0 0 0 {Θ1 Θ 2 ... Θ D} for module Metropolis-Hastings-like moves with target distribution: Z P(u|S , A, φ)p(φ)dφ gupta@bios.unc.edu -- p.16/24 (B) Data augmentation: forward-backward algorithm Taking advantage of HMM formulation, recursively define a “forward” probability of occurrence of motif type k at position i: τ K Fi,k = ∑ ∑ Fj,l pλ (i− j)τlk P(motif k at i) j<i l=1 ∑ FN,k = P(A|S , φ, u) k lk P (i−j ) λ Θl j F Θk i j1 F ik F j2 F j3 ... Use “backward” sampling to update AN ∼ P(AN |S , φ), AN−1 ∼ P(AN−1 |AN , S , φ), · · · . Given A, u, parameter updating is straightforward (conjugate priors) gupta@bios.unc.edu -- p.17/24 Skeletal muscle sequences (Wasserman et. al, 2000) 24 aligned sequence pairs from human and mouse genomes involved in skeletal muscle gene regulation 600 sequence 1 800 1000 1.0 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.6 0.8 1.0 0.8 0.6 probability 0.4 0.2 0.0 400 PSfrag replacements 200 probability sequence 1 sequence 2 PSfrag replacements probability sequence 1 sequence 3 PSfrag replacements sequence 2 sequence 3 0 0 500 1000 sequence 2 1500 0 200 400 600 800 sequence 3 Posterior probability of site occurrence in human regulatory acements sequences (first three in set) acements ability ability Red (dotted) line: MEF2 motif Black (solid) line: MYF motif gupta@bios.unc.edu -- p.18/24 1000 Skeletal muscle sequences (Wasserman et. al, 2000) Matches to experimentally derived sites using 5 methods Method MEF MYF SP1 SRF Total SENS SPEC MSpec Time MEME (EM) 0 1 21 0 161 0.14 0.14 0.20 3927.6 BioProspector 6 1 8 1 155 0.10 0.10 0.36 481.0 CisModule 17 0 6 0 222 0.15 0.10 0.40 2160.0 GMS 6 6 2 1 84 0.10 0.25 0.44 68258.75 14 14 4 6 162 0.25 0.23 0.60 131112.9 GMS p∗ EMCModule 12 12 5 7 180 0.23 0.20 0.67 21943.0 17 13 8 10 108 0.31 0.44 0.80 8536.2 EMCModuleJ True 32 50 44 28 154 EMCModuleJ : with JASPAR matrices SENS: sensitivity SPEC: specificity MSpec: (True predicted motif types)/(True motif types) gupta@bios.unc.edu -- p.19/24 Drosophila midline sequences Available data: D. melanogaster and D. pseudoobscura for 30 genes expressed in central nervous system midline cell progenitors. 60 odorant receptor genes (controls) Experimental determination of sites for breathless, single-minded, Toll, and rhomboid genes gupta@bios.unc.edu -- p.20/24 Drosophila midline sequences- analysis Starting collection of 30 motifs from sequences weighted by alignment scores (Compareprospector) EMCmodule detects 3 motifs including Sim:Tango (ACGTG) Mann-Whitney tests for motif scores comparing to the controls are highly significant ( p < 1e − 5) Analysis: Emma Huang gupta@bios.unc.edu -- p.21/24 Error rates compared for 3 data sets FP: Proportion of “spurious” motifs in final module FN : proportion of motifs missed Data set Method B. Subtilis Drosophila Skeletal muscle FN FP FN FP FN FP BioProspector 0 9/11 0.4 88/96 0.4 67/81 MEME 0 45/50 0.2 44/50 0.4 46/50 0.5 4/5 0.6 68/70 0.8 8/8 0 1/3 0.2 7/16 0.4 1/5 AlignAce EMCmodule FP rates may be overestimated for motifs of currently unknown function gupta@bios.unc.edu -- p.22/24 Summary and further improvements No prior information necessary for motifs, inter-site distances Decreases error rates- motif selection and motif updating Can combine previous knowledge (e.g. use TRANSFAC, JASPAR) and also reveal new information Improving inter-motif distance model (long-range dependence?) Comparative genomics? Local chromatin structure Chromatin immunoprecipitation tiling array data Gupta, M. and Liu, J. S. (2005). De-novo cis-regulatory module elicitation for eukaryotic genomes. PNAS 102(20):7079-84. gupta@bios.unc.edu -- p.23/24 Acknowledgements Data: W. Thompson (Wadsworth Lab), S. Crews (UNC), ENCODE project J. Lieb (UNC), R. Losick (Harvard) Emma Huang (UNC): analysis of Midline genes data For reprints, software, questions, contact gupta@bios.unc.edu gupta@bios.unc.edu -- p.24/24