Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter B ¨uhlmann

advertisement
Blessing of heterogeneous large-scale data
for high-dimensional causal inference
Peter Bühlmann
Seminar für Statistik, ETH Zürich
joint work with
Jonas Peters
MPI Tübingen
Nicolai Meinshausen
SfS ETH Zürich
Heterogeneous large-scale data
Big Data
the talk is not (yet) on “really big data”
but we will take advantage of heterogeneity
often arising with large-scale data where
i.i.d./homogeneity assumption is not appropriate
causal inference = intervention analysis
in genomics (for yeast or plants):
if we would make an intervention at a single (or many) gene(s),
what would be its (their) effect on a response of interest?
want to infer/predict such effects without actually doing the
intervention
e.g. from observational data (cf. Pearl; Spirtes, Scheines & Glymour)
(from observations of a “steady-state system”)
or, as is mainly our focus: from
I
observational and interventional data with well-specified
interventions
I
changing environments or experimental settings with
“vaguely” or “un-” specified interventions
that is, from “heterogeneous” data
causal inference = intervention analysis
in genomics (for yeast or plants):
if we would make an intervention at a single (or many) gene(s),
what would be its (their) effect on a response of interest?
want to infer/predict such effects without actually doing the
intervention
e.g. from observational data (cf. Pearl; Spirtes, Scheines & Glymour)
(from observations of a “steady-state system”)
or, as is mainly our focus: from
I
observational and interventional data with well-specified
interventions
I
changing environments or experimental settings with
“vaguely” or “un-” specified interventions
that is, from “heterogeneous” data
Genomics
1. Flowering of Arabidopsis Thaliana
phenotype/response variable of interest:
Y = days to bolting (flowering)
“covariates” X = gene expressions from p = 210 326 genes
question: infer/predict the effect of knocking-out a single gene
on the phenotype/response variable Y ?
using statistical method based on n = 47 observational data
; validated the top-predictions with randomized experiments
with some moderate success
(Stekhoven, Moraes, Sveinbjörnsson, Hennig, Maathuis & PB, 2012)
Genomics
1. Flowering of Arabidopsis Thaliana
phenotype/response variable of interest:
Y = days to bolting (flowering)
“covariates” X = gene expressions from p = 210 326 genes
question: infer/predict the effect of knocking-out a single gene
on the phenotype/response variable Y ?
using statistical method based on n = 47 observational data
; validated the top-predictions with randomized experiments
with some moderate success
(Stekhoven, Moraes, Sveinbjörnsson, Hennig, Maathuis & PB, 2012)
2. Gene expressions of yeast
p = 5360 genes
phenotype of interest: Y = expression of first gene
“covariates” X = gene expressions from all other genes
and then
phenotype of interest: Y = expression of second gene
“covariates” X = gene expressions from all other genes
and so on
infer/predict the effects of single gene deletions on all other
genes
Effects of single gene deletions on all other genes (yeast)
(Maathuis, Colombo, Kalisch & PB, 2010)
• p = 5360 genes (expression of genes)
• 231 single gene deletions ; 1.2 · 106 intervention effects
• the truth is “known in good approximation”
(thanks to intervention experiments)
goal: prediction of the true large intervention effects
based on observational data with no knock-downs
IDA
Lasso
Elastic−net
Random
1,000
n = 63
observational data
True positives
800
600
400
200
0
0
1,000
2,000
3,000
False positives
4,000
REGRESSION
IDA
Lasso
Elastic−net
Random
1,000
True positives
800
600
400
200
0
0
1,000
2,000
3,000
False positives
4,000
REGRESSION
because: for
Y =
p
X
βj X (j) + ε
j=1
βj measures effect of X (j) on Y when
keeping all other variables {X (k) ; k 6= j} fixed
but when doing an intervention at a gene ; some/many other
genes might change as well and cannot be kept fixed
causal inference framework: allows to define
dynamic notion of intervention effect
“without keeping other variables fixed”
800
True positives
“predictions of single gene
deletions effects in yeast”
IDA
Lasso
Elastic−net
Random
1,000
600
was a good finding for a difficult
problem...
400
200
0
0
1,000
2,000
3,000
4,000
False positives
... but some criticisms apply:
I
not very “robust”:
depends somewhat how we define the “truth”
(has been found recently by J. Mooij, J. Peters and others)
I
old, publicly available data (Hughes et al., 2000)
I
no collaborator... just (publicly available) data
{z
}
|
this is good!
we didn’t use the interventional data for training
I
A new and “better” attempt for the “same” problem
single gene knock-down in yeast and measure genomewide
expression (observational and interventional data)
goal: predict unseen gene knock-down effects
based on both observational and interventional data
collaborators:
Frank Holstege, Patrick Kemmeren et al. (Utrecht)
data from modern technology
Kemmeren, ..., and Holstege (Cell, 2014)
Graphical and Structural equation models
late 1980s: Pearl; Spirtes, Glymour, Scheines; Dawid; Lauritzen;. . .
powerful language of graphs make formulations and models
more transparent (Evans and Richardson, 2011)
variables X1 , . . . , Xp+1 (Xp+1 = Y is the response of interest)
directed acyclic graph (DAG) D 0 encoding the true underlying
causal influence diagram
structural equation model (SEM):
Xj ← fj0 (XpaD0 (j) , εj ), j = 1, . . . , p + 1,
ε1 , . . . , εp+1 independent
X
e.g. linear Xj ←
βjk0 Xk + εj j = 1, . . . , p + 1
k∈paD 0 (j)
causal variables for Y = Xp+1 : S 0 = {j; j ∈ paD 0 (Y )}
0 , j ∈ pa (Y )
causal coefficients for Y in linear SEM: βYj
D0
severe issues of identifiability !
(X , Y ) ∼ N2 (0, Σ)
X
X causes Y
Y
X
Y
Y causes X
agenda for estimation
(Chickering, 2002; Shimizu, 2005; Kalisch & PB, 2007;... )
1. estimate the Markov equivalence class of DAGs D0 : D̂
severe issues of identifiability !
2. derive causal variables: the ones which are causal in all
DAGs from D̂; derive bounds for causal effects based on D̂
(Maathuis, Kalisch & PB, 2009)
goal: construction of confidence statements
(without knowing the structure of the underlying graph)
problem:
direct likelihood-based inference:
(as in e.g. Mladen Kolar’s talk for undirected graphs)
difficult, because of severe identifiability issues !
nevertheless: our goal is to construct a procedure
Ŝ ⊆ {1, . . . , p} such that
P[
Ŝ ⊆ S 0
]≥1−α
| {z }
no false positives
without the need to specify identifiable components
that is: if a variable Xj cannot be identified to be causal ⇒ j ∈
/ Ŝ
we do this with an entirely new framework
(which might be “much simpler” for causal inference)
NOT or AVOIDING
graphical model fitting, potential outcome models,...
and maybe more accessible to experts in regression modeling?
nevertheless: our goal is to construct a procedure
Ŝ ⊆ {1, . . . , p} such that
P[
Ŝ ⊆ S 0
]≥1−α
| {z }
no false positives
without the need to specify identifiable components
that is: if a variable Xj cannot be identified to be causal ⇒ j ∈
/ Ŝ
we do this with an entirely new framework
(which might be “much simpler” for causal inference)
NOT or AVOIDING
graphical model fitting, potential outcome models,...
and maybe more accessible to experts in regression modeling?
Causal inference using invariant prediction
Peters, PB and Meinshausen (2015)
a main message:
causal structure/components remain the same
for different sub-populations
while the non-causal components can change across
sub-populations
thus:
; look for “stability” of structures among
different sub-populations
Causal inference using invariant prediction
Peters, PB and Meinshausen (2015)
a main message:
causal structure/components remain the same
for different sub-populations
while the non-causal components can change across
sub-populations
thus:
; look for “stability” of structures among
different sub-populations
goal:
find the causal variables (components) among a p-dimensional
predictor variable X for a specific response variable Y
consider data
(X e , Y e ) ∼ F e ,
e∈E
with response variables Y e and predictor variables X e
here: e ∈
E
denotes an experimental setting
|{z}
space of exp. sett.
“heterogeneous” data from
different environments/experiments e ∈ E
(aspect of “Big Data”)
goal:
find the causal variables (components) among a p-dimensional
predictor variable X for a specific response variable Y
consider data
(X e , Y e ) ∼ F e ,
e∈E
with response variables Y e and predictor variables X e
here: e ∈
E
denotes an experimental setting
|{z}
space of exp. sett.
“heterogeneous” data from
different environments/experiments e ∈ E
(aspect of “Big Data”)
data
(X e , Y e ) ∼ F e , e ∈ E
example 1: E = {1, 2} encoding observational (1) and all
potentially unspecific interventional data (2)
example 2: E = {1, 2} encoding observational data (1) and
(repeated) data from one specific intervention (2)
example 3: E = {1, 2, 3} ... or E = {1, 2, 3, . . . , 26} ...
do not need data from carefully
designed (randomized) experiments
Invariance Assumption
for a set S ∗ ⊆ {1, . . . , p}: L(Y e |XSe∗ ) is invariant across e ∈ E
for linear model setting: there exists a vector γ ∗ such that
∀e ∈ E :
Y e = X e γ ∗ + εe , εe ⊥ XSe∗ , S ∗ = {j; γj∗ 6= 0}
εe ∼ Fε the same for all e
X e has an arbitrary distribution, different across e
γ ∗ , S ∗ is interesting in its own right!
namely the parameter and structure which remain invariant
across experimental settings, or across heterogeneous groups
Invariance Assumption
for a set S ∗ ⊆ {1, . . . , p}: L(Y e |XSe∗ ) is invariant across e ∈ E
for linear model setting: there exists a vector γ ∗ such that
∀e ∈ E :
Y e = X e γ ∗ + εe , εe ⊥ XSe∗ , S ∗ = {j; γj∗ 6= 0}
εe ∼ Fε the same for all e
X e has an arbitrary distribution, different across e
γ ∗ , S ∗ is interesting in its own right!
namely the parameter and structure which remain invariant
across experimental settings, or across heterogeneous groups
Moreover: link to causality
assume:
in short : L(Y e |XSe∗ ) is invariant across e ∈ E
Proposition (Peters, PB & Meinshausen, 2015)
If E does not affect the structural equation for Y in a SEM:
X
e.g. linear SEM: Y e ←
βYk Xke + εeY
|{z}
|{z}
k∈pa(Y ) ∀e
∼Fε ∀e
then S 0 = pa(Y ) satisfies the Invariance Assumption (w.r.t. E)
{z
}
|
causal var.
the causal variables lead to invariance (of conditional distr.)
Moreover: link to causality
assume:
in short : L(Y e |XSe∗ ) is invariant across e ∈ E
Proposition (Peters, PB & Meinshausen, 2015)
If E does not affect the structural equation for Y in a SEM:
X
e.g. linear SEM: Y e ←
βYk Xke + εeY
|{z}
|{z}
k∈pa(Y ) ∀e
∼Fε ∀e
then S 0 = pa(Y ) satisfies the Invariance Assumption (w.r.t. E)
{z
}
|
causal var.
the causal variables lead to invariance (of conditional distr.)
if E does not affect structural equation for Y :
S 0 = pa(Y ) satisfies the Invariance Assumption
this assumption holds for example for:
I
do-intervention (Pearl) at variables different than Y
I
noise (or “soft”) intervention (Eberhardt & Scheines, 2007)
at variables different than Y
in addition: there might be many other S ∗ satisfying the
Invariance Assumption
but uniqueness is not really important (see later)
how do we know whether
E is not affecting structural equation for Y ?
if E does affect structural equation for Y :
we will argue:
“robustness” of our procedure (proposed later)
; no causal statements
no false positives
conservative, but on the safe side
how do we know whether
E is not affecting structural equation for Y ?
if E does affect structural equation for Y :
we will argue:
“robustness” of our procedure (proposed later)
; no causal statements
no false positives
conservative, but on the safe side
Invariance Assumption: plausible to hold with real data
two-dimensional conditional distributions of observational (blue)
and interventional (orange) data
(no intervention at displayed variables X , Y )
seemingly
no invariance
of conditional d.
plausible
invariance
of conditional d.
A procedure: population case
require and exploit the Invariance Assumption
L(Y e |XSe∗ ) the same across e ∈ E
H0,γ,S (E) :
γk = 0 if k ∈
/ S and
∃Fε such that ∀ e ∈ E :
Y e = X e γ + εe , εe ⊥ XSe , εe ∼ Fε the same for all e
if H0,γ,S (E) is true:
I
model is correct
I
S, γ are plausible causal variables/predictors and
coefficients
and
H0,S (E) : there exists γ such that H0,γ,S (E) holds
S is called “plausible causal predictors” if H0,S (E) holds
A procedure: population case
require and exploit the Invariance Assumption
L(Y e |XSe∗ ) the same across e ∈ E
H0,γ,S (E) :
γk = 0 if k ∈
/ S and
∃Fε such that ∀ e ∈ E :
Y e = X e γ + εe , εe ⊥ XSe , εe ∼ Fε the same for all e
if H0,γ,S (E) is true:
I
model is correct
I
S, γ are plausible causal variables/predictors and
coefficients
and
H0,S (E) : there exists γ such that H0,γ,S (E) holds
S is called “plausible causal predictors” if H0,S (E) holds
identifiable causal predictors under E:
is defined as the set S(E), where
\
S(E) = {
S; H0,S (E) holds
}
|
{z
}
plausible causal predictors
the intersection of all plausible causal predictors
under the Invariance Assumption we have: for any S ∗ ,
S(E) ⊆ S ∗
and this is key to obtain confidence bounds for identifiable
causal predictors
identifiable causal predictors under E:
is defined as the set S(E), where
\
S(E) = {
S; H0,S (E) holds
}
|
{z
}
plausible causal predictors
the intersection of all plausible causal predictors
under the Invariance Assumption we have: for any S ∗ ,
S(E) ⊆ S ∗
and this is key to obtain confidence bounds for identifiable
causal predictors
we have by definition:
S(E1 ) ⊆ S(E2 ) if E1 ⊆ E2
with
I
more interventions
I
more “heterogeneity”
I
more “diversity in complex data”
we can identify more causal predictors
identifiable causal predictors :
S(E) % as E %
question: when is S(E) = S 0 ?
but it is not important that it is “equal to” or unique
(see later)
Theorem (Peters, PB and Meinshausen, 2015)
S(E) = S 0 = (parental set of Y in the causal DAG)
if there is:
I a single do-intervention for each variable other than Y and |E| = p
I a single noise intervention for each variable other than Y and |E| = p
I a simultaneous noise intervention and |E| = 2
the conditions can be relaxed such that it is not necessary to intervene at all
the variables
Statistical confidence sets for causal predictors
“the finite sample version of S(E) =
T
S {S;
H0,S (E) is true}”
for “any” S ⊆ {1, . . . , p}:
test whether H0,S (E) is accepted or rejected
Ŝ(E) =
\
{H0,S accepted at level α}
S
for H0,S (E):
test constancy of regression param. and its residual error distr.
across e ∈ E
by weakening H0,S (E) to H̃0,S (E):
e
e
∃β, σ : βpred
(S) ≡ β, σpred
(S) ≡ σ ∀e ∈ E
where
e
e
e 2
(S) = argminβ;βk =0 (k ∈S)
βpred
/ E|Y − X β| ,
e
e
σpred
(S) = (E|Y e − X e βpred
(S)|2 )1/2
note: H0,S (E) true =⇒ H̃0,S (E) true
testing H̃0,S (E): assuming Gaussian errors
D T Σ−1
D D
∼ F (ne , n−e − |S| − 1)
2
σ̂ ne
D = Y e − Ŷ e , Ŷ e based on data \{e}
T
T
ΣD = Ine + Xe,S (X−e,S
X−e,S )−1 Xe,S
reject H̃0,S (E) if p-value < α/|E|
Ŝ(E) =
\
{H0,S accepted at level α}
S
for some significance level 0 < α < 1
going through all sets S?
Ŝ(E) =
\
{H0,S accepted at level α}
S
for some significance level 0 < α < 1
going through all sets S?
going through all sets S?
1. start with S = ∅: if H0,∅ (E) accepted =⇒ Ŝ(E) = ∅
2. consider small sets S of cardinality 1, 2, . . .
and construct corresponding intersections S∩ with
previously considered accepted sets S (H0,S (E) accepted)
for S with H0,S accepted :
S∩ ← S∩ ∩ S
if intersection S∩ = ∅ =⇒ Ŝ(E) = ∅
if not:
discard all S with S ⊇ S∩
and continue with the remaining sets
3. for large p:
restrict search space by variables from Lasso regression;
need a faithfulness assumption (and sparsity and assumptions
on X e for justification)
confidence sets with invariant prediction
1. for each S ⊆ {1, . . . , p} construct a set Γ̂S (E) as follows:
H̃0,S (E) rejected at α/(2|E|) ; set Γ̂S (E) = ∅
if
otherw.
Γ̂S (E) = classic. (1 − α/2) CI for βS based on XS
2. set
Γ̂(E) =
[
S⊆{1,...,p}
Γ̂S (E)
(est. plausible causal coeff.)
we then obtain:
confidence set for S ∗ : Ŝ(E)
confidence set for γ ∗ : Γ̂(E)
Theorem (Peters, PB and Meinshausen, 2015)
assume: linear model, Gaussian errors
then, for any γ ∗ , S ∗ satisfying the Invariance Assumption:
P[Ŝ(E) ⊆ S ∗ ] ≥ 1 − α : confidence w.r.t. true causal var.
P[γ ∗ ∈ Γ̂(E)] ≥ 1 − α : confidence set for true causal param.
and can choose S ∗ = S 0 = causal variables in linear SEM
if E does not affect struct. eqn. of Y
“on the safe side” (conservative)
we do not need to care about identifiability: if the effect is not
identifiable, the method will not wrongly claim an effect
we do not require that S ∗ is minimal and unique
Theorem (Peters, PB and Meinshausen, 2015)
assume: linear model, Gaussian errors
then, for any γ ∗ , S ∗ satisfying the Invariance Assumption:
P[Ŝ(E) ⊆ S ∗ ] ≥ 1 − α : confidence w.r.t. true causal var.
P[γ ∗ ∈ Γ̂(E)] ≥ 1 − α : confidence set for true causal param.
and can choose S ∗ = S 0 = causal variables in linear SEM
if E does not affect struct. eqn. of Y
“on the safe side” (conservative)
we do not need to care about identifiability: if the effect is not
identifiable, the method will not wrongly claim an effect
we do not require that S ∗ is minimal and unique
“the first” result on statistical confidence for potentially
non-identifiable causal predictors when structure is unknown
(route via graphical modeling for confidence sets seems awkward)
leading to (hopefully) more
reliable causal inferential statements
“Robustness”
if Invariance Assumption does not hold
because E has a direct effect on Y
; for all S: L(Y e |XSe ) is not invariant across e ∈ E
; Ŝ(E) = ∅ (at least as n → ∞)
still on the safe side but no power
Empirical results: simulations
1.00
● ●
●●
●
●
●●●●●
●●
●
●
●● ● ●
●
●
●
●
●● ●
●
●●●
●
●
●
Lingam
●●
●
●
●
●
● ●
●●
●
●●
●
●
●
●
●●
●
● ●
●
●●
●● ●
●
●
● ● ●●
●
●
●●●
● ●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●● ●
●
● ●●
●●
●●●
● ●●
●●
●
●
●
● ●
●
●●
●
●●
●
●●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
power to
detect causal predictors
●
●
●
●
●●
●
●
●● ● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
● ●
●
●●
● ●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
● ●
●
●
●
●● ●
●●
●
●
●
●
●
●
●
●●
●● ●
●
●
●
●
●●
●
●● ●
●
●●
● ● ●●
●
●●
●●
●
●●● ●●
●●
● ●●●
●● ●
●
●
●
●
●●
●
●
●
●
Invariant prediction
●
●
●
●●
●● ●
● ●
●●●●
●
●●
●
●
●●
●● ●●●
Gies (known)
Gies (unknown)
●
●
●● ●●
● ●
● ●●
●
●
●
●
●●●●
●
●●● ●
●
●
● ● ●●
●
● ●●
● ●● ●
●
●
● ● ●
●
●●
●●●●
●●●
●●
●●
●● ●
●●
● ● ●●●
● ●
● ●
●
●
●●
●● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●● ●●
●●● ●●
●
●●●
● ●●
● ●
●
●
●
●
● ●
●
● ● ●
●
●
● ●●
●● ●
●
●
● ●● ●
●
●
●
●
● ●
●
● ●●
●
● ●
● ●●
●
●
●
●●
●
● ●
●
● ●●
●
●●
●
●
●●
●
●● ●
●
●
●
●
●
●● ● ●
●●
●
●●●●●●
●
●●
●
● ● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
● ●
●●●●
●
●
●
●●
●
●
●
●
●
●
●●
●
● ●
●
●●
●
●●
●
●
● ● ●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
● ●
●
●●
●
●● ●
●●
●
●
●
● ●●
●
●
●●
●
● ●
●
●● ●
●●● ●
●● ●●●
●
●
● ●●●
● ●●
●
● ●●
● ●
● ●●
●
●●
●●● ●●
●● ●
● ● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●●●
●●
● ● ●
●
●
●
●●
●● ● ●
●
●●●
● ●
● ●
● ● ● ●●
●● ●
● ●●
●
●
●●
● ● ●● ●
●
●●● ●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
Invariant prediction
●
● ●
●
●
●● ●
● ●●
●●
●
●
●
●
●●● ●
●
●
●
●
●●●
●
●
● ●
●
● ●
● ●
●●
● ●
●
●●
● ●
●●
●
● ●●
●●
●●
●●
●
●●
●●● ●
●
● ●●
●
●● ●
●●● ●●
●
●
●
● ●
●
●
● ● ●
●
●
●
●
●
●
ction
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
● ●
●
●● ● ●
● ●
●
●
●
●
●●●●●
●
● ●
●●● ●●
Lingam
●
●
●
●
nown)
Gies (known)
● ●
●
●●
nown)
●
●●●●●
●● ●●● ●
●●●
●●
●●
●
●●
●●
●●
●●
●
Gies (unknown)
● ●
●
familywise error rate:
P[Ŝ(E) 6⊆ S ∗ ], aimed at 0.05
●
●
●●
●
●●
●
●
●
●
●
gam
●
Ges
Ges
Regression
●
●
●
●
●●●
●●●
● ●●●
●●
●●
●●
●●
●●
●
● ●●
●
●●
●
●
●●
●
●●
●
●
●
● ● ●●
●●
●
●
●
●
●
●
●
●● ●●
●
●●
●● ●
● ● ●
●
● ●
●●
●
●
●
●
●
●
● ●
●● ●
● ●●●
●
●●
●●
●
●
Regression
Marginal
Marginal
FWER
●●●
●●● ●
●
●●
●●
●●●
●● ●●● ●
●●●●●
●● ●
●
●
●●
●●
●
●
● ●
●
●
●●
●●
●
●
●
●● ●
●●
●
●
●
●
●●●
●
●●● ●●
●
●●
● ●●●
●
●
●
●
●
●
●● ●
●
●
●●
●
●
●●
●
● ●●
●
● ●
●
●
●
0.50
●
Ges
Regression
Marginal
●
●
●●● ●●
●● ● ●
●
● ●
●
●● ●●
●
●
●
●
0.00
●
●
●
●●
●
●●
●
● ●●●
● ●
● ●
●
●
●
●
●
●
● ●
●
● ● ●
●● ●
●●
●
● ●●
●●
●
●
● ●
●
●
●
●
●
●
● ● ●●●
● ●●
●
●●
●●
●
●
●
0.25
0.50
●
●
●
●
● ●
● ●
●●
●●
● ●●
●●
● ●●●
●●●
●●●
●
●
●
●
● ●●
●
●
●● ●●
● ●
●
●● ● ●
●●● ●●
●
0.00
SUCCESS PROBABILITY
0.75
1.00
100 different scenarios, 1000 data sets per scenario:
|E| = 2, nobs = ninterv ∈ {100, . . . , 500}, p ∈ {5, . . . , 40}
●
●●
● ●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
Single gene deletion experiments in yeast
p = 6170 genes
response of interest: Y = expression of first gene
“covariates” X = gene expressions from all other genes
and then
response of interest: Y = expression of second gene
“covariates” X = gene expressions from all other genes
and so on
infer/predict the effects of a single gene knock-down on all
other genes
collaborators:
Frank Holstege, Patrick Kemmeren et al. (Utrecht)
data from modern technology
Kemmeren, ..., and Holstege (Cell, 2014)
Kemmeren et al. (2014):
genome-wide mRNA expressions in yeast: p = 6170 genes
I
nobs = 160 “observational” samples of wild-types
I
nint = 1479 “interventional” samples
each of them corresponds to a single gene deletion strain
for our method:
I
we use |E| = 2 (observational and interventional data)
I
training-test data:
• training: all observational and 2/3 of interventional data
• test: other 1/3 of gene deletion interventions
• repeat this for the three blocks of interventional data
I
since every interventional data point is used once as a
response variable:
we use coverage 1 − α/nint with α = 0.01 and nint = 1479
Results
8 genes are significant (α = 0.01 level) causal variables
(each of the 8 genes “causes” another gene)
validation with test data
method
invar.pred.
GIES
PC-IDA
marg.corr.
rand.guess.
no. true pos.
(out of 8)
6
2
2
2
*
*: quantiles for selecting true positives among 7 random draws
2 (95%), 3 (99%)
; our invariant prediction method has most power !
and it should exhibit control against false positive selections
8
6
4
2
0
# STRONG INTERVENTION EFFECTS
PERFECT
INVARIANT
HIDDEN−INVARIANT
PC
RFCI
REGRESSION (CV−Lasso)
GES and GIES
RANDOM (99% prediction−
interval)
0
5
10
15
20
25
# INTERVENTION PREDICTIONS
I : invariant prediction method
H: invariant prediction with some hidden variables
Validation (Meinshausen, Hauser, Mooij, Peters, Versteeg & PB, 2015)
with intervention experiments: strong intervention effect (SIE)
with yeastgenome.org database: scores A-F
rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
cause
YMR104C
YPL273W
YCL040W
YLL019C
YMR186W
YDR074W
YMR173W
YGR162W
YOR027W
YJL115W
YOR153W
YLR270W
YOR153W
YJL141C
YAL059W
YLR263W
YGR271C-A
YLL019C
YCL040W
YMR310C
effect
YMR103C
YMR321C
YCL042W
YLL020C
YPL240C
YBR126C
YMR173W-A
YGR264C
YJL077C
YLR170C
YDR011W
YLR345W
YBL005W
YNR007C
YPL211W
YKL098W
YDR339C
YGR130C
YML100W
YOR224C
SIE
X
X
X
X
X
A
B
C
D
E
F
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
SIE: correctly predicting a strong intervention effect which is in
the 1%- or 99% tail of the observational data
Flow cytometry data (Sachs et al., 2005)
I
p = 11 abundances of chemical reagents
I
8 different environments (not “well-defined” interventions)
(one of them observational; 7 different reagents added)
I
each environment contains ne ≈ 700 − 10 000 samples
goal:
recover network of causal relations (linear SEM)
Erk
Mek
Akt
PIP3
PLCg
PKA
Raf
PKC
PIP2
JNK
p38
approach: invariant causal prediction
(one variable the response Y ; the other 10 the covariates X ;
do this 11 times with every variable once the response)
main concern: different environments might have a direct
influence on the response ; Invariance Assumption would fail
main concern: different environments might have a direct
influence on the response ; Invariance Assumption would fail
instead of requiring invariance among 8 environments
; require invariance for pairs of environments Eij (28 pairs in
total) and do Bonferroni correction by taking the union
S̃(E) = ∪i<j Ŝα/28 (Eij )
this is weakening the assumption that an intervention should
not directly influence the response Y
and if a pair of interventions does nevertheless
; Ŝ = ∅ (“robustness” as discussed before)
Erk
Mek
Akt
PIP3
PLCg
PKA
Raf
PKC
PIP2
JNK
p38
blue edges: only invariant causal prediction approach (ICP)
red: only ICP allowing hidden variables and feedback
purple: both ICP with and without hidden variables
solid: all relations that have been reported in literature
broken: new findings not reported in the literature
; reasonable consensus with existing results
but no real ground-truth available
serves as an illustration that we can work with “vaguely defined
interventions”
Concluding thoughts
generalize Invariance Assumption and statistical testing to
nonparametric/nonlinear models
in particular additive models
∀e ∈ E : Y e = f ∗ (XSe∗ ) + εe , εe ∼ Fε , εe ⊥ XS ∗
X
∀e ∈ E : Y e =
fj∗ (Xje ) + εe , εe ∼ Fε , εe ⊥ XS ∗
j∈S ∗
the statistical significance testing becomes more difficult
improved identifiability with nonlinear SEMs (Mooij et al., 2009)
generalize to include hidden variables
S 0 = pa(Y )∩observed variables, SH0 = pa(Y )∩hidden variables
presented procedure still OK if (essentially):
- no interventions at Y , at SH0 and ancestors(SH0 )
- no hidden confounder between Y and S 0
H3
H1
H2
X1
Y
X8
X12
S0
more general hidden variable models can be treated with a
“related”
technique
C
E
(Rothenhäusler, Heinze, Peters & Meinshausen, 2015)
Y
I
W
W
X
Y
C
E
I
I
(X , Y ) form a DAG
W
X = (C,
E)
without knowing C, E
C ← Bc←i I + Bc←w W + εCX
Y1 ← By ←w W + By←c C + εY
E ← Be←i I + Be←w W + Be←c C + Be←y Y + εE
I
Y
provocative next step: how about using “Big Data”?
; structure the “large-scale” data into different
unknown groups of experimental settings E
that is: learn E from data
“optimal” E is a trade-off between
identifiability and statistical power
learning/estimating E
problem: given a “large bag of data”, can we estimate the
unknown underlying different experimental conditions?
mathematically: denote by
J all experimental settings, satisfying Invariance
P Assumption
; mixture distribution for e ∈ constr. E: F e = j∈J wje F j
when pooling two constr. experimental settings e1 and e2 :
; new mixture distribution with weights (w e1 + w e2 )/2
- mixture modeling
- change point modeling for (time) ordered data
might be useful to estimate E
causal components remain the same for
different sub-populations or experimental settings
; exploit the power of heterogeneity in complex data!
and confidence bounds follow naturally
Thank you!
Software
R-package: pcalg
(Kalisch, Mächler, Colombo, Maathuis & PB, 2010–2015)
R-package: InvariantCausalPrediction (Meinshausen, 2014)
References to some of our own work:
I
Peters, J., Bühlmann, P. and Meinshausen, N. (2015). Causal inference using invariant prediction:
identification and confidence intervals. To appear in J. Royal Statistical Society, Series B (with discussion).
Preprint arXiv:1501.01332
I
Meinshausen, N., Hauser, A. Mooij, J., Peters, J., Versteeg, P. and Bühlmann, P. and (2015). Causal
inference from gene perturbation experiments: methods, software and validation. Preprint.
I
Hauser, A. and Bühlmann, P. (2015). Jointly interventional and observational data: estimation of
interventional Markov equivalence classes of directed acyclic graphs. Journal of the Royal Statistical
Society, Series B, 77, 291-318.
I
Hauser, A. and Bühlmann, P. (2012). Characterization and greedy learning of interventional Markov
equivalence classes of directed acyclic graphs. Journal of Machine Learning Research 13, 2409-2464.
I
Kalisch, M., Mächler, M., Colombo, D., Maathuis, M.H. and Bühlmann, P. (2012). Causal inference using
graphical models with the R package pcalg. Journal of Statistical Software 47 (11), 1-26.
I
Stekhoven, D.J., Moraes, I., Sveinbjörnsson, G., Hennig, L., Maathuis, M.H. and Bühlmann, P. (2012).
Causal stability ranking. Bioinformatics 28, 2819-2823.
I
Maathuis, M.H., Colombo, D., Kalisch, M. and Bühlmann, P. (2010). Predicting causal effects in large-scale
systems from observational data. Nature Methods 7, 247-248.
I
Maathuis, M.H., Kalisch, M. and Bühlmann, P. (2009). Estimating high-dimensional intervention effects from
observational data. Annals of Statistics 37, 3133-3164.
Download