CS 5263 Bioinformatics Reverse-engineering Gene Regulatory Networks Genes and Proteins Gene (DNA) Transcriptional regulation Transcription (also called expression) mRNA mRNA degradation Translational regulation Translation Protein (De)activation Post-translational regulation Gene Regulatory Networks • Functioning of cell controlled by interactions between genes and proteins • Genetic regulatory network: genes, proteins, and their mutual regulatory interactions repressor gene 1 activator gene 2 repressor gene 3 Reverse-engineering GRNs • GRNs are large, complex, and dynamic • Reconstruct the network from observed gene expression behaviors – Experimental methods focus on a few genes only – Computer-assisted analysis: large scale • Since 1960s – Theoretical study mostly • Attracting much attention since the invent of Microarray technology • Emerging advanced large-scale assay techniques are making it even more feasible (ChIP-chip, ChIP-seq, etc.) Problem Statement • Assumption: expression value of a gene depends on the expression values of a set of other genes • Given: a set of gene expression values under different conditions • Goal: a function for each gene that predicts its expression value from expression of other genes – – – – Probabilistically: Bayesian network Boolean functions: Boolean network Linear functions: linear model Other possibilities such as decision trees, SVMs Characteristics • Gene expression data is often noisy, with missing values • Only measures mRNA level – Many genes regulated not only on the transcriptional level • # genes >> # experiments. Underdetermined problem!!!! • Correlation causality • Good news: Network structure is sparse (scale-free) Methods for GRN inference • Directed and undirected graphs – E.g. KEGG, EcoCyc • Boolean networks – Kauffman (1969), Liang et al (1999), Shmulevich et al (2002), Lähdesmäki et al (2003) • Bayesian networks – Friedman et al (2000), Murphy and Mian (1999), Hartmink et al (2002) • Linear/non-linear regression models – D’Haeseleer et al (1999), Yeung et al (2002) • Differential equations – Chen, He & Church (1999) • Neural networks – Weaver, Workman and Stormo (1999) Boolean Networks • Genes are either on or off (expressed or not expressed) • State of gene Xi at time t is a Boolean function of the states of some other genes at time t-1 X Y Z X’ Y’ Z’ 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 1 1 0 0 1 X’ = Y and (not Z) 1 0 0 0 1 0 Y’ = X 1 0 1 0 1 0 Z’ = Y 1 1 0 1 1 1 1 1 1 0 1 1 X X’ Y Y’ Z Z’ Learning Boolean Networks for Gene Expression • Assumptions: – Deterministic (wiring does not change) – Synchronized update – All Boolean functions are probable • Data needed: 2N for N genes. (In comparison, N needed for linear models) • General techniques: limit the # of inputs per gene (k). Data required reduced to 2k log(N). Learning Boolean Networks • Consistency Problem – Given: Examples S: {<In, Out>}, where • In {0,1}k, output {0,1} – Goal: learn Boolean function f such that for every <In, Out> S, f(In) = out. – Note: • Given the same input, the output is unique. • For k input variables, there are at most 2k distinct input configurations. – Example: <001,1> <101,1> <110,1> <010,0> <011,0> <101,0> 1,1 5,1 6,1 2,0 3,0 5,0 Learning Boolean Networks <001,1> <101,1> <110,1> <010,0> <101,1> <101,0> ? 1 0 0 ? * 1 ? no clash -> consistency. Question marks -> undetermined elements O (Mk), M is # of experiments N genes, Choose k from N, N * C(N, k) * O(MK) Best-fit problem: Find a function f with minimum # of errors Limited error-size problem: Find all functions with error-size within max Lähdesmäki et al, Machine Learning 2003;52: 147-167. State space and attractor basins What are some biological interpretations of basins and attractors? Linear Models • Expression level of gene at time t depends linearly on the expression levels of some genes at time t-1 t-1 X1 t W11 W21 X2 X3 X1 W31 X2 W32 W33 W31 X3 o Basic model: Xi (t) = Σj Wij Xj(t-1) o Xi’ (t) = Σj Aij Xj(t), where Xi(t) can be measured, Xi’ (t) can be estimated from Xi(t) o In matrix form: X’NM = ANN XNM , where M is the number of time points, N is the number of genes Linear Models (cont’d) • X’NM = ANN ·XNM • ANN: connectivity matrix, Aij describes the type and strength of the influence of the jth gene on the ith gene. • To solve A, need to solve MN linear equations • In general N2 >> MN, therefore underdetermined => infinity number of solutions Get Around The Curse of Dimensionality • Non-linear interpolation to increase # of time points • Cluster genes to reduce # of genes • Singular Value Decomposition (SVD) – A = A0 + CNN · VTNN, where cij = 0 if j > M – Take A0 as a solution, guaranteed smallest sum of squares. • Robust regression – Minimize # of edges in the network – Biological networks are sparse (scale-free) CNN Cij 0 Robust Regression • A = A0 + CNN · VTNN, • Minimizing # of non-zero entries in A by selecting C – Set A = 0, then C · -A0 , solve for C. – Over-determined. (N2 equations, MN free variables). VT = 6 5 4 3 2 • Robust regression – Fit a hyper-plane to a set of points by passing as many points as possible 1 0 0 2 4 6 Simulation Experiments SVD + Robust Regression Yeung et al, PNAS. 2002;99:6163-8. SVD alone Simulation Experiments (cont’d) Linear System Nonlinear System close to steady state Does not work for nonlinear system not close to steady state Scale-free property does not hold on small networks Bayesian Networks X1 • A DAG G (V, E), where X2 X3 X5 X4 – Vertex: a random variable – Edge: conditional distribution for a variable, given its parents in G. • Markov assumption: i, I (Xi, non-descendent(Xi) | PaG(Xi)) I(X3, X4P(Xi | X2), | X3) G(Xi), X5 Chain rule: P(X1, X2,e.g. …, Xn) =Π | PaI(X1, i = 1..n i P (X1, X2, X3, X4, X5) = P(X1) P(X2) P(X3 | X1, X2) P (X4 | X2) P(X5 | X3) Learning: argmaxG P (G | D) = P (D | G) * P (G) / C Bayesian Networks (Cont’d) • Equivalence classes of Bayesian Networks – Same topology, different edge directions – Can not be distinguished from observation C A B A B I (A, B | C) C PDAG • Causality A – Bayesian network does not directly imply causality – Can be inferred from observation with certain assumptions: • no hidden common cause • …… C B C Hidden variable A B Bayesian Networks for Gene Expression Gene E Gene D Gene C Gene A (D | E): Multinomial or linear Gene B Other variables can be added, such as promoters sequences, experiment conditions and time. • Deals with noisy data well, reflects stochastic nature of gene expression • Indication of causality • Practical issues: – Learning is NP-hard – Over-fitting – Equivalent classes of graphs • Solution: – Heuristic search, sparse candidate – Model averaging – Learning partial models Learning Bayesian Nets • Find G to maximize Score (G | D), where – Score(G | D) = Σi Score (Xi, PaG(Xi) | D) • Hill-climbing – Edge addition, edge removal, edge reversal • Divide-and-conquer – Solve for sub-graphs • Sparse candidate algorithm – Limit the number of candidate parents for each variables. (Biological implications – sparse graph) – Iteratively modifying the candidate set Partial Models (Features) • Model Averaging – Learn many models, common sub-graphs will be more likely to be true – Confidence measure: # of times a sub-graph appeared – Method: bootstrap • Markov relations – A is in B’s Markov blanket iff A B or A • Order relations A … B C B A and B in some joint biological interaction A is a cause of B Experimental Results Markov Relations • Real biological data set: Yeast cell cycle data • 800 genes, 76 experiments, 200-fold bootstrap • Test for significance and robustness – More higher scoring features in real data than in randomized data – Order relations are more robust than Markov relations with respect to local probability models. Friedman et al, J Comput Biol. 2000;7:601-20 Transcriptional regulatory network • • • • TF Gene Promoter Who regulates whom? When? Where? How? A and not B B A g1 A or B A B g3 Not (A and B) A and B A B g2 A B g4 PNAS 2003;100(9):5136-41 Data-driven vs. model-driven methods condition clustering gene MF Descriptive Learning model Post-processing Biological insights Explanatory, predictive “A description of a process that could have generated the observed data” Data-driven approaches Genes Clustering Hierarchical, K-means, … Motif finding MEME, Gibbs, AlignACE, … Experiments • Assumption – Co-expressed genes are likely co-regulated: not necessarily true • Limitations: – Clustering is subjective – Statistically over-represented but non-functional “junk” motifs – Hard to find combinatorial motifs Model-based approaches • Intuition: find motifs that are not only statistically over-represented, but are also associated with the expression patterns – E.g., a motif appears in many up-regulated genes but very few other genes => real motif? • Model: gene expression = f (TF binding motifs, TF activities) • Goal: find the function that – Can explain the observed data and predict future data – Captures true relationships among motifs, TFs and expression of genes Transcription modeling e = f (m1, m2, m3, m4) Variables Promoters Motifs Expression g1 g2 g3 g4 g5 ? g6 g7 g8 Assume that gene expression levels under a certain condition are a function of some TF binding motifs on their promoters. Gene labels Different modeling approaches • Many different models, each with its own limitations • Classification models – Decision tree, support vector machine (SVM), naïve bayes, … • Regression models – Linear regression, regression tree, … • Probabilistic models – Bayesian networks, probabilistic Boolean networks, … Decision tree g1 m1 m2 m3 m4 m1 e g2 g3 g4 g5 g6 e = f (m1, m2, m3, m4) no g7 A g8 7, 8 no yes m4 m2 yes B 1, 2, 5 no yes C D 4 3, 6 • Tree structure is learned from data – Only relevant variables (motifs) are used – Many possible trees, the smallest one is preferred • Advantages: – Easy to interpret – Can represent complex logic relationships A real example: transcriptional regulation of yeast stress response • 52 genes up-regulated in heat-shock (postive) • 156 random irresponsive genes (negative) • 356 known motifs RRPE No FHL1 No RAP1 No 151 (-) 10 (+) Yes 5 (+) Yes 11 (+) 1(-) Small tree: only used 4 motifs Yes PAC No 4 (-) 3 (+) Yes 23 (+) All 4 motifs are wellknown to be stressrelated RRPE-PAC combination well-known Application to yeast cell-cycle genes Network by our method Ruan et. al., BMC Genomics, 2009 Model network in Science, 2002;298(5594):799-804 Regression tree g1 m1 m2 m3 m4 e m1 g2 g3 e = f (m1, m2, m3, m4) g4 no yes m4 m2 yes no yes 0>e>2 e2 g5 g6 no g7 g8 0<e<2 e2 • Similar to decision tree • Difference: each terminal node predicts a range of real values instead of a label Multivariate regression tree • • • Multivariate labels: use multiple experiments simultaneously Use motifs to classify genes into co-expressed groups Does not need clustering in advance m1 m1 m2 m3 m4 g1 g2 g3 g4 g5 g6 g7 g8 no e1 e2 e3 e4 e5 yes m4 no 7 Phuong,T., et. al., Bioinformatics, 2004 yes m2 no yes 3 6 8 1 2 5 4 Modeling with TF activities • Gene expression = f (binding motifs, TF activities) g = f (tf1, tf2, tf3, tf4) e1 e2 e3 e4 e5 tf1 tf2 tf3 tf4 g tf1 tf2 tf3 tf4 e1 rotate e2 e3 e4 e5 Soinov et al., Genome Biol, 2003 g tf1 0 >0 g0 g>0 A Decision Tree Model Segal et al. Nat Genet. 2003,34(2):166-76. gene experiment A decision tree model of gene expressions Algorithm BDTree • Gene expression = f (binding motifs, TF activities) • Ruan & Zhang, Bioinformatics 2006 • Basic idea: – Iteratively partition an expression matrix by splitting genes or experiments – Split of genes is according to motif scores – Split of conditions is according to TF expression levels – The algorithm decides the best motifs or TFs to use Transcriptional regulation of yeast stress response • 173 experiments under ~20 stress conditions • 1411 differentially expressed genes • ~1200 putative binding motifs – Combination of ChIP-chip data, PWMs, and over-represented k-mers (k = 5, 6, 7) • 466 TFs Experiments Genes Genes with motifs FHL1 but no RRPE are down-regulated when Ppt1 is down-regulated and Yfl052w is upregulated …… Genes with motifs RRPE & PAC are down-regulated when TFs Tpk1 & Kin82 are upregulated Biological validation • Most motifs and TFs selected by the tree are well-known to be stress-related – E.g., motifs RRPE, PAC, FHL1, TFs Tpk1 and Ppt1 • 42 / 50 blocks are significantly enriched with some Gene Ontology (GO) functional terms • 45 / 50 blocks are significantly enriched with some experimental conditions RRPE & PAC, ribosome biogenesis (60/94, p < e-65) RRPE only, ribosome biogenesis (28/99, p < e-18) FHL1, protein biosynthesis (98/105, p<e-87) STRE (agggg) carbohydrate metabolism p < e-20 PAC Nitrogen metabolism Relationship between methods c1 c2 c3 c4 c5 t1 t2 t3 t4 A m1 m2 m3 m4 g1 g2 g3 g4 g5 g6 g7 g8 B • A, C: from promoter to expression – A: single cond – C: multi conds D C • B, D: from expression to expression – B: single gene – D: multi genes