A hierarchical regression mixture model for inferring gene regulatory networks Mayetri Gupta gupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30 Upstream regulation ↔ Downstream expression Fundamental question: how can we understand the biological mechanisms leading to disease? ...gtggtTAGAATagcgactgttttt... gene 1 ...taggTATAATacagtctgacaaaa... gene 2 ...cagcaacattgaTATAATtgccat... gene 3 ...ctaaaacaatTATTATttatcagg... gene 4 |TATAAT| G 1 2 3 4 5 6 bits 0 G 1 CT 2 Co-regulated genes share similar upstream patterns Identify genes that are differentially expressed under different treatments or conditions gupta@bios.unc.edu -- p.2/30 Gene Regulation: DNA Motifs Proteins bind to DNA to activate gene transcription G 1 2 3 4 5 6 bits 0 |TATAAT| G 1 CT 2 Position specific weight matrix (PSWM) or Motif gupta@bios.unc.edu -- p.3/30 Gene regulation in complex genomes Harder problem: many transcription factors working in co-ordination LARGE sequence search space: using sequence data only → many false positives? gupta@bios.unc.edu -- p.4/30 Upstream regulation ↔ Downstream expression Gene expression contains information about sequence motifs Sequence may contain information on gene co-regulation Expression clustering → Motif discovery or Expression clustering ↔ Motif discovery ? What if initial clustering is inaccurate? gupta@bios.unc.edu -- p.5/30 eplacements 3 −1 0 1 2 G1 G2/M M/G1 S S/G2 −2 gene expression Cell-cycle data set (Spellman, Mol Biol Cell. 1998) 28 time (minutes) 63 98 Measurements over 18 time points, 3 different experiments ∼ 800 genes known to be cell-cycle dependent Do clusters of genes share common TFs? Do certain TFs work combinatorially on groups of genes? color: time when gene is active gupta@bios.unc.edu -- p.6/30 Using Gene Expression in Motif Discovery REDUCER (Bussemaker, Nat. Genet. 2001) correlates expression of gene with number of motif occurrences MDScan (Liu, Nat Biotech. 2002) Most strongly differentially expressed genes → candidate motifs. Motif Regressor (Conlon, PNAS 2003) Multiple regression model: Sum of motif effects explains gene expression Yg = α + M ∑ βmSmg + εg m=1 Yg : expression of gene g; Smg : motif-match score gupta@bios.unc.edu -- p.7/30 Using Gene Expression in Motif Discovery Non-parametric approaches Phuong et al. (Bioinformatics, 2004): Classification Trees (CART) Arbitrary decision criterion based on the number of occurrences of a motif type Multivariate adaptive regression splines (Das et al, PNAS 2004) Joint sequence-expression model without parametric connection Holmes and Bruno (ISMB proc.,2000) joint likelihood for sequence and expression data gupta@bios.unc.edu -- p.8/30 Using motif information in gene clustering Infer sets of transcription factors involved in regulating groups of genes Higher transcriptional activity → greater presence of TF binding sites, more pronounced expression changes Genes within a “cluster” may be correlated, with or without sharing common transcription factors Measurements on the same gene in different conditions may be correlated due to sharing the same upstream transcription factor binding sites gupta@bios.unc.edu -- p.9/30 Linear mixed effects model y = Xβ + Zb + ε β: fixed effects b: random effects ε ∼ N(0, τ−1 0 I) y: gene expression Fixed effects: Sequence Motif Levels of factor are reproduced exactly if experiment is repeated Random effects: Expression cluster Levels of factor (expression + clusters) may not be reproduced exactly if experiment is repeated gupta@bios.unc.edu -- p.10/30 Joint Model for Sequence-Expression Complication: gene cluster identity cannot be assumed known Z matrix not “fixed” Conditional on cluster k, (k = 1, . . . , K), vector of log-expression values of gene g generated from mixed-effects model: Yg |zg = k, X, parameters ∼ N(ξg + XgT βk 1, σ2k I) (≡ fk ) gupta@bios.unc.edu -- p.11/30 Joint Model for Sequence-Expression Complication: gene cluster identity cannot be assumed known Z matrix not “fixed” Conditional on cluster k, (k = 1, . . . , K), vector of log-expression values of gene g generated from mixed-effects model: Yg |zg = k, X, parameters ∼ N(ξg + XgT βk 1, σ2k I) (≡ fk ) Unconditionally, a mixture model P(Y |X, parameters) = ∏ genes g " ∑ cluster k πk fk (Yg | parameters) # Which β’s significant, in which cluster? gupta@bios.unc.edu -- p.12/30 Bayesian hierarchical formulation Prior distributions for cluster k parameters: “Expression” model: µk ∼ N(·, v2k0 I) σ2k ∼ InvGamma(·, ·) ξg |zg = k ∼ N(µk , τ0 σ2k I) “Sequence” model: βk ∼ N(β0 ,V k ) Probabilities of cluster membership:P(zg = k) = πk (π1 , . . . , πK ) ∼ Dirichlet(α1 , . . . , αK ) G genes; T measurements per gene gupta@bios.unc.edu -- p.13/30 Bayesian hierarchical formulation For each βk , we use multivariate extension of g-prior (Zellner, 1986), so that cσ2k −1 Sk , Vk = T where Sk = ∑ zg =k Xg XgT Why use g-prior? Computational efficiency, varying c −→ more/less informative Induces dependence among genes in a cluster due to sequence effects 2 T cσ 2 T −1 k Cov(Yg ,Yg0 |Zg , Zg0 = k) = vk0 I + 1 Xg Sk Xg 1 T gupta@bios.unc.edu -- p.14/30 Regulatory Motif Model Every non-site position multinomial with θ0 = (θ01 , . . . , θ04 ) · · · θ0 θ0 θ1 · · · θ 6 θ0 θ0 · · · Every motif position i multinomial with θi = (θi1 , . . . , θi4 ) Product Multinomial model Challenge: Find position of sites and θ’s MEME (Bailey, ISMB 1994) Gibbs Motif Sampler (Liu, JASA 1995), AlignAce (Roth, Nat. BioTech 1998), BioProspector (Liu, Pac. Symp. Biocomp. 2001), Stochastic Dictionary (Gupta, JASA 2003) gupta@bios.unc.edu -- p.15/30 Sequence Motif Scoring Starting set of motifs De-novo: Different clusters of genes exhibiting “strong” up- or down-regulation (MDScan) Databases: Derived from experimental data Motif score for w-width motif j and upstream sequence g: P(seq(i, i + w − 1)|motif j) Xg j = ∑ P(seq(i, i + w − 1)| null) positions i gupta@bios.unc.edu -- p.16/30 Sequence motif selection Initial set of D motif candidates (D can be large!) In regression model, want to know which motifs correlated “significantly” with response (gene expression) u = (u1 , . . . , uD ) where ( 1 if motif j is in model uj = 0 otherwise. Prior probability of motif inclusion D P(u) = ∏ ηu j (1 − η)1−u j j=1 Variable selection from LARGE potential set gupta@bios.unc.edu -- p.17/30 Outline of method select motifs from large initial set update clusters given motifs active in each class update model parameters given cluster membership and functional motifs in each class Complications Cluster membership z unknown Number of clusters K may be unknown (assume fixed for now) Number of motif candidates is large gupta@bios.unc.edu -- p.18/30 Parameter updating using MCMC Update from joint posterior distribution P(θ, β, u, z|Y , X, K) θ = (µ, σ2 , π) For updating steps for parameters θ, β and z, marginalize over other parameters for efficiency (Conjugate forms permit this) For updating u, use evolutionary Monte Carlo (Liang and Wong, JASA 2001) Select motifs that have most effect on expression, and differentiate most among clusters gupta@bios.unc.edu -- p.19/30 Three simulated data sets −1.0 + β M1 2 0 2 β M2 0 2 0 -2 0 0 β M3 0 0 0 0 0 2 -2 0 0 (Motif 3 not present in data) −0.5 0.0 0.5 1.0 S1 1.5 1.5 3 −1.0 + o + −0.5 0.0 + 0.5 + o +o 1.0 x13 + 4 + −0.5 0.0 0.5 1.0 1.5 1 0 −1 + Y3 3 + + + 2.0 −1.0 + + + + ++ ++ + + ++ + + + + ++ + +++ + ++ + ++ + ++ ++ + ++++ ++ + + + + + + + + ++ + + o oo +++ + + + + ++ + + + + + oo oo + + ++ ++ ++ + o o o o ooo o + o ++ + + ++ ++ +o ooo +o o o + + o oo o o oo o o o + o o o o o o+ o oo o + o o o o o ooooo oo oo o o o o oo o o + o ooo ooooo ooo o oo o oo o o o o o o + o o oo oo o o o 0.0 0.5 + + −0.5 1.0 S2 0.0 0.5 x33 + + + + −0.5 ++ + + + ++ + o+ + +o + ++ + + + + o+ o+o++ +o+++ + + o o + o + + oo + o + + o+ ooo o +o o+o + + o ++ + + +o++ o+++o oo++o o ++o o +o o +o o+o +o+ oo oo + + oo o o o o ++++ + o + + o +o o o + o+o +ooo + + o+ + o oo o o o o +o o o o o o + o + o o + o o + + o o ++ o+ + + o o +o o o o o o o +o o o o o o o o + + x32 1.5 2.0 2 2 Y2 0 2.5 2 Y1 1 0 2.0 4 o++ +o + + 2.0 + + + + ++ + + −2 4 2 0 −2 (3) Y1 1.0 x12 + ++ ++ +oo + + + + +o+o +++ + + ++ ooooo + ++ + ooooo o+o+ + + ooo + ++ +++ ++++++ ++ ooooo ooooo ++ ++ + ++ ++ ooooo o + ++++ + + oo + o o o + + ooo ++ + + ooo o + oooo ooo ooo + ++ ooo + + + + oo ++ + + oo + oo o ooo o + ooo oo o o + −1.0 + +++ + + 3 2 +++++++ +++ + + ++++ ++ ++ ++ +++ + + +++ ++++ +++++ + oo o ++++++++ oooo +++ + + + + + + + oooooooo ++++++ ooooooo ++ oooooooooo + + + oo ++o ooooo ooooo oooo o oo ooo ooooo + oo o oooooo + ooooooo oo o o −1.0 2 Y3 1 0 −1 1 x31 Coef. C1 C2 C1 C2 C1 C2 0.5 + + o + o o+ + o o + o +o o o+ + o oo o+ + o o o ++ + o oo + o+ +o oo+o o++ + o + o o+ + o++ +++ + + + o o+ + o +o o o + + o + +o o +o o ++ ++oo+ +oo +o+oo + o + +oo +o +o+oo o o o + + ++oo + o+o o +o oo+o ++ + ++ + + o+ + o o + o o++ o o + o o o oo o + + + ++ o +o + + + o o + o oo o + ++ o + o o o + o + + + + o + 1.0 + + + + + + ++++ + + ++ ++ +++ + ++ + + ++ + ++ + ++ ++ + + + +++ + o o ++ ++ ++ ++ + +o++ + ++ + + + + o + o+ +o+ ++ + o++ + o+ + + o +o+ + o o+ +oo + o+++ +o +o o o o +o o + + o o o oo o o o o o+ o o o ooo + o + oo ooo oo o o o o oooooo o o o o oo o o o o + ooo oo o oo o o o oo o o o + o oooo o o oooo o o o o + ++ + Y2 0 4 −1 Data 1 Data 2 Data 3 0.0 + 4 4 3 2 1 0 (2) o −1 2 measurements each Y1 200 “genes” in K = 2 clusters 4 −0.5 + ++ ++ +++ + o +++ o + o ++++ ++ ++ ++ oo + + + o o o + o + o o o ++ ++ + +++ oo o o+ oo oo o + + o++ +++ oo oo+ ++++ oo o oooooo o o o ooo o o oo++ o o +++ + o o o o o+ o o o +++o++ + +o o o o+o++o o o o o o o oo +o o o o o+o+++o+o o oo ++++ +o o + o o o o o oo o +o oo o o o o o + 5 3 2 Y1 1 2 oo + oooo + + + o + oo ooo + + + + o oo ++ + + + oo o + o + o + + + ++ oo + + ++ +oo +++ + oooooooo + + + +++ ++ + + ++ + + + oooo + + ++ + ++ ooooooooo + +++ + + ++ooo+o ++ oooo + + + + ooo+ o+ +o oo + + oooo+ + + + + + o+oo + + + oooo o + ++ o o o + o+ + + + o ++ o + ++ + 2 1 x11 oo o + + 0 5 0 ++ 4 ++ −2 −1 amd MEF2 o 0 3 2 1 o 0 (1) JASPAR database SAP1, SRF, Y1 responding to 3 PSWMs from 4 Regression coeffs. for motifs cor- o + o + o ++ o o +++ o o o ooo ++ + o o o o o o ++ + + oo o o o oo + ++ + ooo o o o + o ++ ++++ oo o + ++++++ o oo o o o o o oo o o oo +++ +++++ o +o+ o + ++ ooo o o o o+ ++ o oo o o + +++ ++ o + oo +o ++o + + + o o o + ooo o oo o ++ + o o o ++++ oo o + + ++o o o o o+ o o o + o+o+ + o ++o ++ o 5 o −1.5 + −1.0 −0.5 0.0 S3 0.5 1.0 o 1.5 Y1 with motif 1,2,3 scores (columns) in 3 data sets (rows) gupta@bios.unc.edu -- p.20/30 eplacements eplacements eplacements Simulation study results: Bayes factors Data 2 Data 3 data3 data2 P(MK ) −500 −600 −700 −700 −300 −600 −250 −500 P(MK ) P(MK ) −400 −200 −400 −300 Data 1 1 2 K 3 4 5 6 1 2 K 3 4 5 6 1 2 K 3 4 5 6 Optimal choice: K = 2 Marginal model probability for MK through Double mixture importance sampling (s) (t) (t) (θ |z ) (z ) 1 π π 1 2 (t) (s) P(Y |MK )= ˆ P(Y |Z , θ , K) ∑ ∑ Nt Ns t s f1 (θ(s) |z(t) ) f2 (z(t) ) gupta@bios.unc.edu -- p.21/30 1 0 coefficient 2 1 c1 c2 c1 c2 c1 c2 −3 −0.5 −2 −2 0.0 −1 −1 0 coefficient 1.5 1.0 0.5 coefficient Data 3 2 Data 2 2.0 Data 1 replacements replacements replacements Simulation: β estimates 1 1 2 2 motif type 1 1 2 2 motif type motif type Data sets 1 and 2: Motifs 1,2 selected Data set 3: Motif 1 selected gupta@bios.unc.edu -- p.22/30 acements 3 −1 0 1 2 G1 G2/M M/G1 S S/G2 −2 gene expression Spellman (1998) data 28 63 time (minutes) 98 Pick different groups of genes that are highly differentially expressed Sets of motif candidates of widths 7-12 bp (using MDscan) Motif overlaps lead to collinearity: remove motifs with correlation > 0.5 with a higher-ranked one → 32 candidate motifs Two consecutive time points: 1-2, 3-4, . . ., 17-18 gupta@bios.unc.edu -- p.23/30 Number of clusters K c ) compared to 1-component model log(BF 40 2 3 20 0 2 2 3 2 2 3 3 −20 log(BF) ∗ 3 2 2 3 2 3 2 3 8 9 3 1 2 3 4 5 6 7 time interval SFF SFF bits MCM1 bits A A T A C 6 5 4 TA 7 AT TA 3 6 MCB 0 2 T C 1 A A TA 7 AT TA MCB Significant motifs over time intervals AACAACA TTGTT TTG TG G C 6 5 4 G C 7 G G C 3 0 2 1 1 bits 2 SFF gupta@bios.unc.edu -- p.24/30 A A T A T A MCB C 7 AT TA 6 0 5 7 6 5 G 4 A 3 A 2 T G 4 bits bits T CTAC C C G 1 T A 1 A 5 0 ACGCGT ACGCGT 4 7 6 C 1 0 1 C 3 7 6 5 4 T 3 2 C G T TTT G 1 2 G G C C C TAATTA ACGCGT 2 1 TA AACTAC 1 7 6 5 4 0 AAGAG CG 9 1 2 3 G 5 T 2 GTT TT TG TG G GTC G C bits AACAACA 4 0 C G C 1 2 MCM1 2 3 7 6 C 5 4 3 G 1 TTC GTC T C 3 7 6 5 MCB 2 GTT TT TG TG G GTC G C 0 TG AGACA T A 2 bits T C GGACAGA 1 1 8 1 2 1 SFF A A T A 7 2 2 A AT TA 4 bits 0 3 7 6 5 4 3 C ACGCGT 2 T G C 1 1 G G G G G C 2 6 5 4 0 TTTTGTT bits AACAACA 2 bits T C AACAACA 6 2 2 1 0 A A T A MCB 2 1 AT TA 3 7 6 5 4 3 MCB 0 2 T C 1 A A T A 1 A 2 1 SCB AT TA 2 0 bits bits bits 7 6 5 4 G C G A 5 2 2 1 1 TT A 9 T 8 AACT A T A TT A 4 2 2 1 C G 3 0 CG GAAA ACGCGT ACGCGT 2 bits 1 3 1 2 1 2 2 3 1 1 1 K∗ 7 Interval PSfrag lacements replacements PSfrag replacements PSfrag lacements replacements PSfrag replacements PSfrag lacements replacements PSfrag replacements 5 10 15 20 25 30 0 5 10 15 20 25 30 1.0 0.8 0.6 0.4 0.2 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0.8 0.6 0.4 0.2 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0.6 0.0 0.2 0.4 0.0 0.2 0.4 0.6 0.8 1.0 0.8 0.8 0.6 0.4 0.2 0 1.0 0.0 0.8 0.6 0.4 0.2 1.0 0.0 5 1.0 0 0 0.8 30 0.6 25 0.4 20 0.2 15 0.0 10 1.0 5 1.0 0 0.0 1.0 0.8 0.6 0.4 0.2 0.0 0.8 0.6 0.4 0.2 0.0 Motif index −→ 0.0 Posterior prob. of selection 1.0 Motif influence at different time intervals Selected motif types for 9 time intervals for optimal K gupta@bios.unc.edu -- p.25/30 Significant motifs match experimental PSWMs Index 4 6 7 10 16 18 20 22 25 29 TF name MCM1 GCN1 [CSRE] MCB SFF MCM1 [RME1] SCB PHO4 − Consensus Expt. CGAAGAG/CTCTTCG TCAGTCA/TGACTGA GGACAGA/TCTGTCC ACGCGTA/TACGCGT AACAACA/TGTTGTT CCAATTAGG/CCTAATTGG TTCAGGTAC/GTACCTGAA CGCGAAAAA/TTTTTCGCG CGTACGTAC/GTACGTACG CTTCGCATC/GATGCGAAG CCNNNWWRGG TCAGTCA [YCGGAYRRAWGG] WCGCGW GTMAACAA CCNNNWWRGG [GAACCTCAA] CNCGAAA CACGTK Phase M G1 M M G1 K ≡ {G or T } ; M ≡ {A or C} ; N ≡ {A or C or G or T } ; R ≡ {A or G} ; W ≡ {A or T } ; Y ≡ {C or T } gupta@bios.unc.edu -- p.26/30 Summary Treating gene expression clustering as a variable may help in discovering relationships between functional sequence motifs, and groups of genes they regulate Different groups of genes may behave as a cluster at different time points Upstream sequence motifs “constant” but effects/interactions over time may vary gupta@bios.unc.edu -- p.27/30 Further extensions Motif scoring issues: sensitivity, co-occurrence of sites Efficient model selection Extension to high density ChIP tiling arrays Acknowledgement: Joseph G. Ibrahim (UNC), Jason Lieb (UNC) UNC high-performance scientific computing group gupta@bios.unc.edu -- p.28/30 Model Selection: number of clusters K Likelihood-based methods not valid (BIC, etc.) Bayes factor: ratio of model marginal probabilities Marginal probability for model MK Z P(Y |MK ) = ∑ z θ P(Y |Z, K)p(θ|z)p(Z|K)dθ Double mixture importance sampling (s) (t) (t) π (θ |z ) (z ) 1 π 1 2 (t) (s) P(Y |MK )= ˆ P(Y |Z , θ , K) ∑ ∑ Nt Ns t s f1 (θ(s) |z(t) ) f2 (z(t) ) gupta@bios.unc.edu -- p.29/30 Model Selection: Number of clusters Double mixture importance sampling (s) (t) (t) (θ |z ) (z ) 1 π π 1 2 (t) (s) P(Y |MK )= ˆ P(Y |Z , θ , K) ∑ ∑ Nt Ns t s f1 (θ(s) |z(t) ) f2 (z(t) ) Challenge: Good sampling densities f (·) for z and θ For a simpler case (θ marginalized), Raftery et al (TR, 2003) propose permutation-based methods to find good candidate sampling densities for z gupta@bios.unc.edu -- p.30/30