Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Bühlmann Seminar für Statistik, ETH Zürich joint work with Jonas Peters MPI Tübingen Nicolai Meinshausen SfS ETH Zürich Heterogeneous large-scale data Big Data the talk is not (yet) on “really big data” but we will take advantage of heterogeneity often arising with large-scale data where i.i.d./homogeneity assumption is not appropriate causal inference = intervention analysis in genomics (for yeast or plants): if we would make an intervention at a single (or many) gene(s), what would be its (their) effect on a response of interest? want to infer/predict such effects without actually doing the intervention e.g. from observational data (cf. Pearl; Spirtes, Scheines & Glymour) (from observations of a “steady-state system”) or, as is mainly our focus: from I observational and interventional data with well-specified interventions I changing environments or experimental settings with “vaguely” or “un-” specified interventions that is, from “heterogeneous” data causal inference = intervention analysis in genomics (for yeast or plants): if we would make an intervention at a single (or many) gene(s), what would be its (their) effect on a response of interest? want to infer/predict such effects without actually doing the intervention e.g. from observational data (cf. Pearl; Spirtes, Scheines & Glymour) (from observations of a “steady-state system”) or, as is mainly our focus: from I observational and interventional data with well-specified interventions I changing environments or experimental settings with “vaguely” or “un-” specified interventions that is, from “heterogeneous” data Genomics 1. Flowering of Arabidopsis Thaliana phenotype/response variable of interest: Y = days to bolting (flowering) “covariates” X = gene expressions from p = 210 326 genes question: infer/predict the effect of knocking-out a single gene on the phenotype/response variable Y ? using statistical method based on n = 47 observational data ; validated the top-predictions with randomized experiments with some moderate success (Stekhoven, Moraes, Sveinbjörnsson, Hennig, Maathuis & PB, 2012) Genomics 1. Flowering of Arabidopsis Thaliana phenotype/response variable of interest: Y = days to bolting (flowering) “covariates” X = gene expressions from p = 210 326 genes question: infer/predict the effect of knocking-out a single gene on the phenotype/response variable Y ? using statistical method based on n = 47 observational data ; validated the top-predictions with randomized experiments with some moderate success (Stekhoven, Moraes, Sveinbjörnsson, Hennig, Maathuis & PB, 2012) 2. Gene expressions of yeast p = 5360 genes phenotype of interest: Y = expression of first gene “covariates” X = gene expressions from all other genes and then phenotype of interest: Y = expression of second gene “covariates” X = gene expressions from all other genes and so on infer/predict the effects of single gene deletions on all other genes Effects of single gene deletions on all other genes (yeast) (Maathuis, Colombo, Kalisch & PB, 2010) • p = 5360 genes (expression of genes) • 231 single gene deletions ; 1.2 · 106 intervention effects • the truth is “known in good approximation” (thanks to intervention experiments) goal: prediction of the true large intervention effects based on observational data with no knock-downs IDA Lasso Elastic−net Random 1,000 n = 63 observational data True positives 800 600 400 200 0 0 1,000 2,000 3,000 False positives 4,000 REGRESSION IDA Lasso Elastic−net Random 1,000 True positives 800 600 400 200 0 0 1,000 2,000 3,000 False positives 4,000 REGRESSION because: for Y = p X βj X (j) + ε j=1 βj measures effect of X (j) on Y when keeping all other variables {X (k) ; k 6= j} fixed but when doing an intervention at a gene ; some/many other genes might change as well and cannot be kept fixed causal inference framework: allows to define dynamic notion of intervention effect “without keeping other variables fixed” 800 True positives “predictions of single gene deletions effects in yeast” IDA Lasso Elastic−net Random 1,000 600 was a good finding for a difficult problem... 400 200 0 0 1,000 2,000 3,000 4,000 False positives ... but some criticisms apply: I not very “robust”: depends somewhat how we define the “truth” (has been found recently by J. Mooij, J. Peters and others) I old, publicly available data (Hughes et al., 2000) I no collaborator... just (publicly available) data {z } | this is good! we didn’t use the interventional data for training I A new and “better” attempt for the “same” problem single gene knock-down in yeast and measure genomewide expression (observational and interventional data) goal: predict unseen gene knock-down effects based on both observational and interventional data collaborators: Frank Holstege, Patrick Kemmeren et al. (Utrecht) data from modern technology Kemmeren, ..., and Holstege (Cell, 2014) Graphical and Structural equation models late 1980s: Pearl; Spirtes, Glymour, Scheines; Dawid; Lauritzen;. . . powerful language of graphs make formulations and models more transparent (Evans and Richardson, 2011) variables X1 , . . . , Xp+1 (Xp+1 = Y is the response of interest) directed acyclic graph (DAG) D 0 encoding the true underlying causal influence diagram structural equation model (SEM): Xj ← fj0 (XpaD0 (j) , εj ), j = 1, . . . , p + 1, ε1 , . . . , εp+1 independent X e.g. linear Xj ← βjk0 Xk + εj j = 1, . . . , p + 1 k∈paD 0 (j) causal variables for Y = Xp+1 : S 0 = {j; j ∈ paD 0 (Y )} 0 , j ∈ pa (Y ) causal coefficients for Y in linear SEM: βYj D0 severe issues of identifiability ! (X , Y ) ∼ N2 (0, Σ) X X causes Y Y X Y Y causes X agenda for estimation (Chickering, 2002; Shimizu, 2005; Kalisch & PB, 2007;... ) 1. estimate the Markov equivalence class of DAGs D0 : D̂ severe issues of identifiability ! 2. derive causal variables: the ones which are causal in all DAGs from D̂; derive bounds for causal effects based on D̂ (Maathuis, Kalisch & PB, 2009) goal: construction of confidence statements (without knowing the structure of the underlying graph) problem: direct likelihood-based inference: (as in e.g. Mladen Kolar’s talk for undirected graphs) difficult, because of severe identifiability issues ! nevertheless: our goal is to construct a procedure Ŝ ⊆ {1, . . . , p} such that P[ Ŝ ⊆ S 0 ]≥1−α | {z } no false positives without the need to specify identifiable components that is: if a variable Xj cannot be identified to be causal ⇒ j ∈ / Ŝ we do this with an entirely new framework (which might be “much simpler” for causal inference) NOT or AVOIDING graphical model fitting, potential outcome models,... and maybe more accessible to experts in regression modeling? nevertheless: our goal is to construct a procedure Ŝ ⊆ {1, . . . , p} such that P[ Ŝ ⊆ S 0 ]≥1−α | {z } no false positives without the need to specify identifiable components that is: if a variable Xj cannot be identified to be causal ⇒ j ∈ / Ŝ we do this with an entirely new framework (which might be “much simpler” for causal inference) NOT or AVOIDING graphical model fitting, potential outcome models,... and maybe more accessible to experts in regression modeling? Causal inference using invariant prediction Peters, PB and Meinshausen (2015) a main message: causal structure/components remain the same for different sub-populations while the non-causal components can change across sub-populations thus: ; look for “stability” of structures among different sub-populations Causal inference using invariant prediction Peters, PB and Meinshausen (2015) a main message: causal structure/components remain the same for different sub-populations while the non-causal components can change across sub-populations thus: ; look for “stability” of structures among different sub-populations goal: find the causal variables (components) among a p-dimensional predictor variable X for a specific response variable Y consider data (X e , Y e ) ∼ F e , e∈E with response variables Y e and predictor variables X e here: e ∈ E denotes an experimental setting |{z} space of exp. sett. “heterogeneous” data from different environments/experiments e ∈ E (aspect of “Big Data”) goal: find the causal variables (components) among a p-dimensional predictor variable X for a specific response variable Y consider data (X e , Y e ) ∼ F e , e∈E with response variables Y e and predictor variables X e here: e ∈ E denotes an experimental setting |{z} space of exp. sett. “heterogeneous” data from different environments/experiments e ∈ E (aspect of “Big Data”) data (X e , Y e ) ∼ F e , e ∈ E example 1: E = {1, 2} encoding observational (1) and all potentially unspecific interventional data (2) example 2: E = {1, 2} encoding observational data (1) and (repeated) data from one specific intervention (2) example 3: E = {1, 2, 3} ... or E = {1, 2, 3, . . . , 26} ... do not need data from carefully designed (randomized) experiments Invariance Assumption for a set S ∗ ⊆ {1, . . . , p}: L(Y e |XSe∗ ) is invariant across e ∈ E for linear model setting: there exists a vector γ ∗ such that ∀e ∈ E : Y e = X e γ ∗ + εe , εe ⊥ XSe∗ , S ∗ = {j; γj∗ 6= 0} εe ∼ Fε the same for all e X e has an arbitrary distribution, different across e γ ∗ , S ∗ is interesting in its own right! namely the parameter and structure which remain invariant across experimental settings, or across heterogeneous groups Invariance Assumption for a set S ∗ ⊆ {1, . . . , p}: L(Y e |XSe∗ ) is invariant across e ∈ E for linear model setting: there exists a vector γ ∗ such that ∀e ∈ E : Y e = X e γ ∗ + εe , εe ⊥ XSe∗ , S ∗ = {j; γj∗ 6= 0} εe ∼ Fε the same for all e X e has an arbitrary distribution, different across e γ ∗ , S ∗ is interesting in its own right! namely the parameter and structure which remain invariant across experimental settings, or across heterogeneous groups Moreover: link to causality assume: in short : L(Y e |XSe∗ ) is invariant across e ∈ E Proposition (Peters, PB & Meinshausen, 2015) If E does not affect the structural equation for Y in a SEM: X e.g. linear SEM: Y e ← βYk Xke + εeY |{z} |{z} k∈pa(Y ) ∀e ∼Fε ∀e then S 0 = pa(Y ) satisfies the Invariance Assumption (w.r.t. E) {z } | causal var. the causal variables lead to invariance (of conditional distr.) Moreover: link to causality assume: in short : L(Y e |XSe∗ ) is invariant across e ∈ E Proposition (Peters, PB & Meinshausen, 2015) If E does not affect the structural equation for Y in a SEM: X e.g. linear SEM: Y e ← βYk Xke + εeY |{z} |{z} k∈pa(Y ) ∀e ∼Fε ∀e then S 0 = pa(Y ) satisfies the Invariance Assumption (w.r.t. E) {z } | causal var. the causal variables lead to invariance (of conditional distr.) if E does not affect structural equation for Y : S 0 = pa(Y ) satisfies the Invariance Assumption this assumption holds for example for: I do-intervention (Pearl) at variables different than Y I noise (or “soft”) intervention (Eberhardt & Scheines, 2007) at variables different than Y in addition: there might be many other S ∗ satisfying the Invariance Assumption but uniqueness is not really important (see later) how do we know whether E is not affecting structural equation for Y ? if E does affect structural equation for Y : we will argue: “robustness” of our procedure (proposed later) ; no causal statements no false positives conservative, but on the safe side how do we know whether E is not affecting structural equation for Y ? if E does affect structural equation for Y : we will argue: “robustness” of our procedure (proposed later) ; no causal statements no false positives conservative, but on the safe side Invariance Assumption: plausible to hold with real data two-dimensional conditional distributions of observational (blue) and interventional (orange) data (no intervention at displayed variables X , Y ) seemingly no invariance of conditional d. plausible invariance of conditional d. A procedure: population case require and exploit the Invariance Assumption L(Y e |XSe∗ ) the same across e ∈ E H0,γ,S (E) : γk = 0 if k ∈ / S and ∃Fε such that ∀ e ∈ E : Y e = X e γ + εe , εe ⊥ XSe , εe ∼ Fε the same for all e if H0,γ,S (E) is true: I model is correct I S, γ are plausible causal variables/predictors and coefficients and H0,S (E) : there exists γ such that H0,γ,S (E) holds S is called “plausible causal predictors” if H0,S (E) holds A procedure: population case require and exploit the Invariance Assumption L(Y e |XSe∗ ) the same across e ∈ E H0,γ,S (E) : γk = 0 if k ∈ / S and ∃Fε such that ∀ e ∈ E : Y e = X e γ + εe , εe ⊥ XSe , εe ∼ Fε the same for all e if H0,γ,S (E) is true: I model is correct I S, γ are plausible causal variables/predictors and coefficients and H0,S (E) : there exists γ such that H0,γ,S (E) holds S is called “plausible causal predictors” if H0,S (E) holds identifiable causal predictors under E: is defined as the set S(E), where \ S(E) = { S; H0,S (E) holds } | {z } plausible causal predictors the intersection of all plausible causal predictors under the Invariance Assumption we have: for any S ∗ , S(E) ⊆ S ∗ and this is key to obtain confidence bounds for identifiable causal predictors identifiable causal predictors under E: is defined as the set S(E), where \ S(E) = { S; H0,S (E) holds } | {z } plausible causal predictors the intersection of all plausible causal predictors under the Invariance Assumption we have: for any S ∗ , S(E) ⊆ S ∗ and this is key to obtain confidence bounds for identifiable causal predictors we have by definition: S(E1 ) ⊆ S(E2 ) if E1 ⊆ E2 with I more interventions I more “heterogeneity” I more “diversity in complex data” we can identify more causal predictors identifiable causal predictors : S(E) % as E % question: when is S(E) = S 0 ? but it is not important that it is “equal to” or unique (see later) Theorem (Peters, PB and Meinshausen, 2015) S(E) = S 0 = (parental set of Y in the causal DAG) if there is: I a single do-intervention for each variable other than Y and |E| = p I a single noise intervention for each variable other than Y and |E| = p I a simultaneous noise intervention and |E| = 2 the conditions can be relaxed such that it is not necessary to intervene at all the variables Statistical confidence sets for causal predictors “the finite sample version of S(E) = T S {S; H0,S (E) is true}” for “any” S ⊆ {1, . . . , p}: test whether H0,S (E) is accepted or rejected Ŝ(E) = \ {H0,S accepted at level α} S for H0,S (E): test constancy of regression param. and its residual error distr. across e ∈ E by weakening H0,S (E) to H̃0,S (E): e e ∃β, σ : βpred (S) ≡ β, σpred (S) ≡ σ ∀e ∈ E where e e e 2 (S) = argminβ;βk =0 (k ∈S) βpred / E|Y − X β| , e e σpred (S) = (E|Y e − X e βpred (S)|2 )1/2 note: H0,S (E) true =⇒ H̃0,S (E) true testing H̃0,S (E): assuming Gaussian errors D T Σ−1 D D ∼ F (ne , n−e − |S| − 1) 2 σ̂ ne D = Y e − Ŷ e , Ŷ e based on data \{e} T T ΣD = Ine + Xe,S (X−e,S X−e,S )−1 Xe,S reject H̃0,S (E) if p-value < α/|E| Ŝ(E) = \ {H0,S accepted at level α} S for some significance level 0 < α < 1 going through all sets S? Ŝ(E) = \ {H0,S accepted at level α} S for some significance level 0 < α < 1 going through all sets S? going through all sets S? 1. start with S = ∅: if H0,∅ (E) accepted =⇒ Ŝ(E) = ∅ 2. consider small sets S of cardinality 1, 2, . . . and construct corresponding intersections S∩ with previously considered accepted sets S (H0,S (E) accepted) for S with H0,S accepted : S∩ ← S∩ ∩ S if intersection S∩ = ∅ =⇒ Ŝ(E) = ∅ if not: discard all S with S ⊇ S∩ and continue with the remaining sets 3. for large p: restrict search space by variables from Lasso regression; need a faithfulness assumption (and sparsity and assumptions on X e for justification) confidence sets with invariant prediction 1. for each S ⊆ {1, . . . , p} construct a set Γ̂S (E) as follows: H̃0,S (E) rejected at α/(2|E|) ; set Γ̂S (E) = ∅ if otherw. Γ̂S (E) = classic. (1 − α/2) CI for βS based on XS 2. set Γ̂(E) = [ S⊆{1,...,p} Γ̂S (E) (est. plausible causal coeff.) we then obtain: confidence set for S ∗ : Ŝ(E) confidence set for γ ∗ : Γ̂(E) Theorem (Peters, PB and Meinshausen, 2015) assume: linear model, Gaussian errors then, for any γ ∗ , S ∗ satisfying the Invariance Assumption: P[Ŝ(E) ⊆ S ∗ ] ≥ 1 − α : confidence w.r.t. true causal var. P[γ ∗ ∈ Γ̂(E)] ≥ 1 − α : confidence set for true causal param. and can choose S ∗ = S 0 = causal variables in linear SEM if E does not affect struct. eqn. of Y “on the safe side” (conservative) we do not need to care about identifiability: if the effect is not identifiable, the method will not wrongly claim an effect we do not require that S ∗ is minimal and unique Theorem (Peters, PB and Meinshausen, 2015) assume: linear model, Gaussian errors then, for any γ ∗ , S ∗ satisfying the Invariance Assumption: P[Ŝ(E) ⊆ S ∗ ] ≥ 1 − α : confidence w.r.t. true causal var. P[γ ∗ ∈ Γ̂(E)] ≥ 1 − α : confidence set for true causal param. and can choose S ∗ = S 0 = causal variables in linear SEM if E does not affect struct. eqn. of Y “on the safe side” (conservative) we do not need to care about identifiability: if the effect is not identifiable, the method will not wrongly claim an effect we do not require that S ∗ is minimal and unique “the first” result on statistical confidence for potentially non-identifiable causal predictors when structure is unknown (route via graphical modeling for confidence sets seems awkward) leading to (hopefully) more reliable causal inferential statements “Robustness” if Invariance Assumption does not hold because E has a direct effect on Y ; for all S: L(Y e |XSe ) is not invariant across e ∈ E ; Ŝ(E) = ∅ (at least as n → ∞) still on the safe side but no power Empirical results: simulations 1.00 ● ● ●● ● ● ●●●●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● Lingam ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ●●● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●●● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● power to detect causal predictors ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ●● ●● ● ●●● ●● ●● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ● Invariant prediction ● ● ● ●● ●● ● ● ● ●●●● ● ●● ● ● ●● ●● ●●● Gies (known) Gies (unknown) ● ● ●● ●● ● ● ● ●● ● ● ● ● ●●●● ● ●●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ●●●● ●●● ●● ●● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●● ●● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ●●●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●●● ● ●● ●●● ● ● ● ●●● ● ●● ● ● ●● ● ● ● ●● ● ●● ●●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ●●● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ●●● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● Invariant prediction ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ●● ●● ●● ●● ● ●● ●●● ● ● ● ●● ● ●● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ction ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●●● ● ● ● ●●● ●● Lingam ● ● ● ● nown) Gies (known) ● ● ● ●● nown) ● ●●●●● ●● ●●● ● ●●● ●● ●● ● ●● ●● ●● ●● ● Gies (unknown) ● ● ● familywise error rate: P[Ŝ(E) 6⊆ S ∗ ], aimed at 0.05 ● ● ●● ● ●● ● ● ● ● ● gam ● Ges Ges Regression ● ● ● ● ●●● ●●● ● ●●● ●● ●● ●● ●● ●● ● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●● ●● ● ● Regression Marginal Marginal FWER ●●● ●●● ● ● ●● ●● ●●● ●● ●●● ● ●●●●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ●●● ●● ● ●● ● ●●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● 0.50 ● Ges Regression Marginal ● ● ●●● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● 0.00 ● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ●● ● ● ● 0.25 0.50 ● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ●●● ●●● ●●● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ●●● ●● ● 0.00 SUCCESS PROBABILITY 0.75 1.00 100 different scenarios, 1000 data sets per scenario: |E| = 2, nobs = ninterv ∈ {100, . . . , 500}, p ∈ {5, . . . , 40} ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● Single gene deletion experiments in yeast p = 6170 genes response of interest: Y = expression of first gene “covariates” X = gene expressions from all other genes and then response of interest: Y = expression of second gene “covariates” X = gene expressions from all other genes and so on infer/predict the effects of a single gene knock-down on all other genes collaborators: Frank Holstege, Patrick Kemmeren et al. (Utrecht) data from modern technology Kemmeren, ..., and Holstege (Cell, 2014) Kemmeren et al. (2014): genome-wide mRNA expressions in yeast: p = 6170 genes I nobs = 160 “observational” samples of wild-types I nint = 1479 “interventional” samples each of them corresponds to a single gene deletion strain for our method: I we use |E| = 2 (observational and interventional data) I training-test data: • training: all observational and 2/3 of interventional data • test: other 1/3 of gene deletion interventions • repeat this for the three blocks of interventional data I since every interventional data point is used once as a response variable: we use coverage 1 − α/nint with α = 0.01 and nint = 1479 Results 8 genes are significant (α = 0.01 level) causal variables (each of the 8 genes “causes” another gene) validation with test data method invar.pred. GIES PC-IDA marg.corr. rand.guess. no. true pos. (out of 8) 6 2 2 2 * *: quantiles for selecting true positives among 7 random draws 2 (95%), 3 (99%) ; our invariant prediction method has most power ! and it should exhibit control against false positive selections 8 6 4 2 0 # STRONG INTERVENTION EFFECTS PERFECT INVARIANT HIDDEN−INVARIANT PC RFCI REGRESSION (CV−Lasso) GES and GIES RANDOM (99% prediction− interval) 0 5 10 15 20 25 # INTERVENTION PREDICTIONS I : invariant prediction method H: invariant prediction with some hidden variables Validation (Meinshausen, Hauser, Mooij, Peters, Versteeg & PB, 2015) with intervention experiments: strong intervention effect (SIE) with yeastgenome.org database: scores A-F rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 cause YMR104C YPL273W YCL040W YLL019C YMR186W YDR074W YMR173W YGR162W YOR027W YJL115W YOR153W YLR270W YOR153W YJL141C YAL059W YLR263W YGR271C-A YLL019C YCL040W YMR310C effect YMR103C YMR321C YCL042W YLL020C YPL240C YBR126C YMR173W-A YGR264C YJL077C YLR170C YDR011W YLR345W YBL005W YNR007C YPL211W YKL098W YDR339C YGR130C YML100W YOR224C SIE X X X X X A B C D E F X X X X X X X X X X X X X X X SIE: correctly predicting a strong intervention effect which is in the 1%- or 99% tail of the observational data Flow cytometry data (Sachs et al., 2005) I p = 11 abundances of chemical reagents I 8 different environments (not “well-defined” interventions) (one of them observational; 7 different reagents added) I each environment contains ne ≈ 700 − 10 000 samples goal: recover network of causal relations (linear SEM) Erk Mek Akt PIP3 PLCg PKA Raf PKC PIP2 JNK p38 approach: invariant causal prediction (one variable the response Y ; the other 10 the covariates X ; do this 11 times with every variable once the response) main concern: different environments might have a direct influence on the response ; Invariance Assumption would fail main concern: different environments might have a direct influence on the response ; Invariance Assumption would fail instead of requiring invariance among 8 environments ; require invariance for pairs of environments Eij (28 pairs in total) and do Bonferroni correction by taking the union S̃(E) = ∪i<j Ŝα/28 (Eij ) this is weakening the assumption that an intervention should not directly influence the response Y and if a pair of interventions does nevertheless ; Ŝ = ∅ (“robustness” as discussed before) Erk Mek Akt PIP3 PLCg PKA Raf PKC PIP2 JNK p38 blue edges: only invariant causal prediction approach (ICP) red: only ICP allowing hidden variables and feedback purple: both ICP with and without hidden variables solid: all relations that have been reported in literature broken: new findings not reported in the literature ; reasonable consensus with existing results but no real ground-truth available serves as an illustration that we can work with “vaguely defined interventions” Concluding thoughts generalize Invariance Assumption and statistical testing to nonparametric/nonlinear models in particular additive models ∀e ∈ E : Y e = f ∗ (XSe∗ ) + εe , εe ∼ Fε , εe ⊥ XS ∗ X ∀e ∈ E : Y e = fj∗ (Xje ) + εe , εe ∼ Fε , εe ⊥ XS ∗ j∈S ∗ the statistical significance testing becomes more difficult improved identifiability with nonlinear SEMs (Mooij et al., 2009) generalize to include hidden variables S 0 = pa(Y )∩observed variables, SH0 = pa(Y )∩hidden variables presented procedure still OK if (essentially): - no interventions at Y , at SH0 and ancestors(SH0 ) - no hidden confounder between Y and S 0 H3 H1 H2 X1 Y X8 X12 S0 more general hidden variable models can be treated with a “related” technique C E (Rothenhäusler, Heinze, Peters & Meinshausen, 2015) Y I W W X Y C E I I (X , Y ) form a DAG W X = (C, E) without knowing C, E C ← Bc←i I + Bc←w W + εCX Y1 ← By ←w W + By←c C + εY E ← Be←i I + Be←w W + Be←c C + Be←y Y + εE I Y provocative next step: how about using “Big Data”? ; structure the “large-scale” data into different unknown groups of experimental settings E that is: learn E from data “optimal” E is a trade-off between identifiability and statistical power learning/estimating E problem: given a “large bag of data”, can we estimate the unknown underlying different experimental conditions? mathematically: denote by J all experimental settings, satisfying Invariance P Assumption ; mixture distribution for e ∈ constr. E: F e = j∈J wje F j when pooling two constr. experimental settings e1 and e2 : ; new mixture distribution with weights (w e1 + w e2 )/2 - mixture modeling - change point modeling for (time) ordered data might be useful to estimate E causal components remain the same for different sub-populations or experimental settings ; exploit the power of heterogeneity in complex data! and confidence bounds follow naturally Thank you! Software R-package: pcalg (Kalisch, Mächler, Colombo, Maathuis & PB, 2010–2015) R-package: InvariantCausalPrediction (Meinshausen, 2014) References to some of our own work: I Peters, J., Bühlmann, P. and Meinshausen, N. (2015). Causal inference using invariant prediction: identification and confidence intervals. To appear in J. Royal Statistical Society, Series B (with discussion). Preprint arXiv:1501.01332 I Meinshausen, N., Hauser, A. Mooij, J., Peters, J., Versteeg, P. and Bühlmann, P. and (2015). Causal inference from gene perturbation experiments: methods, software and validation. Preprint. I Hauser, A. and Bühlmann, P. (2015). Jointly interventional and observational data: estimation of interventional Markov equivalence classes of directed acyclic graphs. Journal of the Royal Statistical Society, Series B, 77, 291-318. I Hauser, A. and Bühlmann, P. (2012). Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research 13, 2409-2464. I Kalisch, M., Mächler, M., Colombo, D., Maathuis, M.H. and Bühlmann, P. (2012). Causal inference using graphical models with the R package pcalg. Journal of Statistical Software 47 (11), 1-26. I Stekhoven, D.J., Moraes, I., Sveinbjörnsson, G., Hennig, L., Maathuis, M.H. and Bühlmann, P. (2012). Causal stability ranking. Bioinformatics 28, 2819-2823. I Maathuis, M.H., Colombo, D., Kalisch, M. and Bühlmann, P. (2010). Predicting causal effects in large-scale systems from observational data. Nature Methods 7, 247-248. I Maathuis, M.H., Kalisch, M. and Bühlmann, P. (2009). Estimating high-dimensional intervention effects from observational data. Annals of Statistics 37, 3133-3164.