Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver Outline Biology Data and Background Knowledge Problems Algorithms Causal Graph gene protein mRNA mRNA protein gene Sources: http://www.ornl.gov/sci/techresources/Human_Genome/graphics/slides/images/REGNET.jpg Protein States Folding Location (nucleus, membrane, etc.) Phosphorylation at different sites Ubiquitination etc. Protein States Folding Location (nucleus, membrane, etc.) Phosphorylation at different sites Ubiquitination Levels of Description Differential equation models of rates of reaction Which transcription factors bind to which sites, and can in turn be prevented or aided in binding to sites Which gene products affect the rate at which other genes produce proteins Example Graph Outline Biology Data and Background Knowledge Problems Algorithms Knockout Experiments insert DNA construct into cell DNA construct recombines with target gene target gene then either does not translate at all or translate a nonfunctional protein Data: Hughes lab yeast data Hughes lab data on S. cerevisiae: 63 wild-type strains 267 gene-deletion mutants No information on direct effects, only total effects But does include information on non-effects Must normalize the data to account for differential manipulation - what is the right normalization method? Data: M3D Many Microbes Microarrays Database (M3D) Collected from multiple sources of published data different labs, strains, experimental conditions, genetic manipulations E. coli (907), S. oneidensis (245), S. cerevisiae (904) Experiment descriptions/features standardized Affymetrix arrays uniformly normalized using Robust Multi-array Average (RMA); raw data also available Data: RegulonDB RegulonDB: The E. coli regulatory network database Curated database of expert knowledge, constantly updated List of known TF → gene effects, annotated with valence (+;-;±), type(s) of evidence supporting, publications supporting No information on size of effects Some information about direct effects, from Chromatin ImmunoPrecipitation (ChIP) assays, recognized binding sites, etc. If an effect is not in RegulonDB, that doesn't mean it doesn't exist! Outline Biology Data and Background Knowledge Problems Algorithms Search for Genetic Regulatory Networks Goal: Search for a Directed Acyclic Graph (DAG) representing the “direct" regulatory effects (relative to the set of genes). This is a hard problem! # of DAGs is super-exponential in # of genes (4,300 genes in E. coli ) Unobserved confounders Environmental conditions Excluded genes Latent TFs that inluence multiple genes Cycles Unfaithfulness Non-linearity, non-Gaussianity Density not strictly positive Small sample size Aggregation How can we evaluate performance? Li and Biggin Feedback The equilibrium state of a feedback system can be represented by a cyclic graph. If the joint distribution is Gaussian or multinomial, then the natural extension of d-separation to cyclic graphs entails the corresponding conditional independence. If the joint distribution is not Gaussian or multinomial, then the natural extension of d-separation to cyclic graphs does not entail the corresponding conditional independence. Feedback There is an extended sense of “pattern” to represent the set of all Markov equivalent graphs, cyclic or acyclic. However, it is much more complicated than a pattern – not all Markov equivalent cyclic graphs share the same set of adjacencies, and there are dependencies among which edges. There is an algorithm that is an extension of the PC algorithm for searching for cyclic graphs. We do not have a graphical representation of the set of all Markov equivalence graphs with cycles and latent variables. Local Markov Theorem (Chu) Given an acyclic graph G representing the causal relations among a set V of random variables. Let Y, X1, . . . , Xk ∈ V, and X = {X1, . . . , Xk } be the set of parents of Y in G. If Y = cTX + ε, where cT = (c1, . . . , ck), and ε is a noise term independent of all non-descendents of Y , then Y is independent of all its nonparents, non-descendents conditional on its parents X, and this relation holds under aggregation. 22 Outline Biology Data and Background Knowledge Problems Algorithms Direct versus Indirect Effects What is a manipulation? Set the value of a variable {means you break the influence of whatever normally influences it. e.g. Setting arcA := 0 Direct v. total effects: Total effects: If I manipulate fnr and let everything else vary as usual, what happens to sodA ? Direct effects: If I manipulate fnR and also clamp arcA to its current value, what happens to sodA ? (What if I don't know whether to clamp narK ?) Gold Standard: Direct v. Total Effects Direct effects: For known true graph: evaluate using graph similarity metrics True positive and false positive rate for adjacencies and orientations Structural Hamming distance For unknown true graph: Experimental control of all potential back-door paths Mechanistic approach: protein binding arrays Total effects: For known true graph: Path from cause to eect in both graphs? (Path length? Size of path coecients?) Structural intervention distance For unknown true graph: Truth: Gene knock-out experiments estimate true total causal eect Prediction: IDA Patterns Markov Equivalence Class (MEC) contains all DAGs consistent with set of conditional independences All DAGs in the MEC share the same adjacencies All share the same “unshielded colliders” Can be represented with a “pattern”: common \skeleton" of adjacencies edges directed when all the DAGs within the MEC agree on their orientation edges left undirected otherwise Intervention Effects When the DAG is Absent (IDA) • Run PC. • Calculate effect q of manipulating arcA on sodA in each graph. • Count how many times each value q occurs. fnr arcA sodA fnr fnr arcA 0 sodA arcA 0 sodA fnr fnr z arcA q1 arcA q2 sodA sodA fnr {0,0,0,q1,q2,q3} arcA 0 sodA fnr arcA q3 sodA IDA fnr • Too expensive if many variables. • By considering only local structure around sodA and arcA can calculate all possible effects, but not how many times they occur. arcA sodA arcA sodA fnr arcA sodA fnr fnr fnr {0,q1,q2,q3} arcA arcA sodA sodA fnr fnr arcA q1 q2 sodA arcA q1 q2 sodA IDA application p = 5360 genes (expression of genes) ´6 intervention effects 231 gene knock downs ; 1.2 10 the truth is “known in good approximation” (thanks to intervention experiments) goal: prediction of the true large intervention effects based on observational data with no knock-downs n = 63 observational data IDA Maathuis CStar Runs IDA multiple times in order to choose the genes that are most stably among the ones selected as the strongest. Sample 50% of the original data set (with replacement) one hundred times. For each subsample, run the IDA algorithm. Take the output of the IDA algorithm for that subsample and orders the variables by the size of the estimated (by IDA) lower bound of the total effect of the variable on the target. 32 Stability Selection Step Record the frequency with which a given variable appears in the top q of the total effect sizes, for a user selected value q. Select the variables that appear with the highest frequency (that is are judged most often to have lower bounds of total effects on the target that are large) 33 Stability Selection order the variables Π1 > Π2 . . . Πp. most often least often Define the stably selected genes (covariates) as Ŝstable = { j :P j > p thr } for some threshold 0.5 < πthr ≤ 1. Denote the wrongly selected genes (false positives) by Ŝstable Ç Swhere V= Sfalse is the set of (covariates) false whose true lower bound Ŝ βj isÇ0.S stable stable Ŝ false false ÇS 34 Stability Selection For a given threshold πthr and a given value of q, q2 E[V ] £ 2 p thr -1 p 1 if p thr E[V ] 1 q = P̂ j then £ 2 p 2P̂ j -1 p 2 35 Stability Selection CStaR is relatively insensitive to the choice of the range of qs. Down to a certain lower bound, small values of q lead to higher sensitivity. For q-values below the lower bound, the ranking becomes unstable again. 36 Stability Selection All genes are ranked according to the median rank with respect to the different q-values. Ties in the final ranking are sorted according to median total causal effect size. 37 Cstar Results Choosing Experiments: Steckhoven et al. Mouse-ear cress response Y: days to bolting (flowering) of the plant Covariates X: geneexpression profile Observational data with n = 47 and p = 21,326 39 Experimental Confirmation: Steckhoven et al. PC + IDA + stability selection Performed experiment on 14 of the top 20 (not previously known, easily available mutant) 40 Results: Arabidopsis thaliana 9 among the 14 mutants survived 4 among the 9 mutants (genes) showed a significant effect for Y relative to the wildtype (non-mutated plant) 41 Chen and Storey “For an individual organism, DNA has the useful feature thatit is usually a static variable, meaning that it is fixed and will not change with changing RNA levels, protein levels, phenotypes,or environmental conditions. By performing designed crosses of genetically distinct inbred or isogenic lines, one can randomize the genotypes of an organism from two or more genetic backgrounds, thereby producing independent realizations of DNA content from offspring to offspring.” Chen and Storey Identify causal relations of the form L → Ti → Tj where L is known to be exogenous and prior to Ti and Tj L is the genotype at a fixed locus, generated through crossing two haploid parental strains to produce 112 recombinant haploid segragant strains Ti and Tj are expression levels of genes Given this background knowledge, we just need to determine that L and Ti are dependent, Ti and Tj are dependent, and Tj is independent of Ti given L. Faith et al Used Many Microbes Microarrays Database (M3D) Evaluated using RegulonDB Restricted search space: only allowed edges out of genes coding for TFs Compared several search algorithms (but not a fair comparison for Bayes Net learning algorithm) References LS Chen LS, F Emmert-Streib, and JD Storey (). Harnessing naturally randomized transcription to inferregulatory relationships among genes. Genome Biology. 2007. T. Chu C. Glymour, R. Scheines, P. Spirtes. A statistical problem for inference to regulatory structure from associations of gene expression measurement with microarrays. Bioinformatics 2003;19:1147-52. PMID: 12801876. Jeremiah J. Faith, Boris Hayete, Joshua T. Thaden, Ilaria Mogno, Jamey Wierzbowski, Guillaume Cottarel, Simon Kasif, James J. Collins, and Timothy S. Gardner. Largescale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLOS Biology, 5(1):0054–0066, 2007. J Li, M Biggin, Statistics requantitates the central dogma, Science, 347(6226), 1066-1067, 2015. Marloes H Maathuis, Diego Colombo, Markus Kalisch, and Peter Bühlmann. Predicting causal effects in large-scale systems from observational data. Nature Methods, 7(4):247–248, 2010. K. Sachs, O. Perez, D. Pe’er, D.A. Lauffenburger, and G.P. Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science 308, 523–529, (2005). Daniel J. Stekhoven, Izabel Moraes, Gardar Sveinbjornsson, Lars Hennig, Marloes H. Maathuis, and Peter Buhlmann, Causal stability ranking, Bioinformatics, 28 (21) 2012, pp. 2819–2823