gnrnew1 - Carnegie Mellon University

advertisement
Causal Inference &
Genetic Regulatory
Networks
Peter Spirtes
Carnegie Mellon University
With slides from Lizzie Silver
Outline
 Biology
 Data and Background Knowledge
 Problems
 Algorithms
Causal Graph
 gene
protein
 mRNA
mRNA
 protein
gene
Sources: http://www.ornl.gov/sci/techresources/Human_Genome/graphics/slides/images/REGNET.jpg
Protein States
 Folding
 Location (nucleus, membrane, etc.)
 Phosphorylation at different sites
 Ubiquitination
 etc.
Protein States
 Folding
 Location (nucleus, membrane, etc.)
 Phosphorylation at different sites
 Ubiquitination
Levels of Description
 Differential equation models of rates of reaction
 Which transcription factors bind to which sites, and can
in turn be prevented or aided in binding to sites
 Which gene products affect the rate at which other
genes produce proteins
Example Graph
Outline
 Biology
 Data and Background Knowledge
 Problems
 Algorithms
Knockout Experiments
 insert DNA construct into cell
 DNA construct recombines with target gene
 target gene then either does not translate at all or
translate a nonfunctional protein
Data: Hughes lab yeast data
 Hughes lab data on S. cerevisiae:
 63 wild-type strains
 267 gene-deletion mutants
 No information on direct effects, only total effects
 But does include information on non-effects
 Must normalize the data to account for differential
manipulation - what is the right normalization method?
Data: M3D
 Many Microbes Microarrays Database (M3D)
 Collected from multiple sources of published data




different labs, strains, experimental conditions, genetic
manipulations
E. coli (907), S. oneidensis (245), S. cerevisiae (904)
Experiment descriptions/features standardized
Affymetrix arrays uniformly normalized using Robust
Multi-array
Average (RMA); raw data also available
Data: RegulonDB
 RegulonDB: The E. coli regulatory network database
 Curated database of expert knowledge, constantly updated
 List of known TF → gene effects, annotated with valence (+;-;±),
type(s) of evidence supporting, publications supporting
 No information on size of effects
 Some information about direct effects, from Chromatin
 ImmunoPrecipitation (ChIP) assays, recognized binding sites, etc.
 If an effect is not in RegulonDB, that doesn't mean it doesn't exist!
Outline
 Biology
 Data and Background Knowledge
 Problems
 Algorithms
Search for Genetic Regulatory
Networks
 Goal: Search for a Directed Acyclic Graph (DAG) representing the “direct"
regulatory effects (relative to the set of genes).
 This is a hard problem!











# of DAGs is super-exponential in # of genes (4,300 genes in E. coli )
Unobserved confounders
Environmental conditions
Excluded genes
Latent TFs that inluence multiple genes
Cycles
Unfaithfulness
Non-linearity, non-Gaussianity
Density not strictly positive
Small sample size
Aggregation
 How can we evaluate performance?
Li and Biggin
Feedback
 The equilibrium state of a feedback system can be
represented by a cyclic graph.
 If the joint distribution is Gaussian or multinomial, then
the natural extension of d-separation to cyclic graphs
entails the corresponding conditional independence.
 If the joint distribution is not Gaussian or multinomial,
then the natural extension of d-separation to cyclic
graphs does not entail the corresponding conditional
independence.
Feedback
 There is an extended sense of “pattern” to represent the set
of all Markov equivalent graphs, cyclic or acyclic.
 However, it is much more complicated than a pattern – not
all Markov equivalent cyclic graphs share the same set of
adjacencies, and there are dependencies among which
edges.
 There is an algorithm that is an extension of the PC
algorithm for searching for cyclic graphs.
 We do not have a graphical representation of the set of all
Markov equivalence graphs with cycles and latent variables.
Local Markov Theorem (Chu)
 Given an acyclic graph G representing the
causal relations among a set V of random
variables. Let Y, X1, . . . , Xk ∈ V, and X = {X1, . .
. , Xk } be the set of parents of Y in G. If Y =
cTX + ε, where cT = (c1, . . . , ck), and ε is a
noise term independent of all non-descendents
of Y , then Y is independent of all its nonparents, non-descendents conditional on its
parents X, and this relation holds under
aggregation.
22
Outline
 Biology
 Data and Background Knowledge
 Problems
 Algorithms
Direct versus Indirect Effects
 What is a manipulation?
 Set the value of a variable {means you
break the influence of whatever normally
influences it. e.g. Setting arcA := 0
 Direct v. total effects:
 Total effects: If I manipulate fnr and let
everything else vary as usual, what
happens to sodA ?
 Direct effects: If I manipulate fnR and also
clamp arcA to its current value, what
happens to sodA ?
 (What if I don't know whether to clamp
narK ?)
Gold Standard: Direct v. Total
Effects
 Direct effects:


For known true graph: evaluate using graph similarity metrics
 True positive and false positive rate for adjacencies and orientations
 Structural Hamming distance
For unknown true graph:
 Experimental control of all potential back-door paths
 Mechanistic approach: protein binding arrays
 Total effects:


For known true graph:
 Path from cause to eect in both graphs?
 (Path length? Size of path coecients?)
 Structural intervention distance
For unknown true graph:
 Truth: Gene knock-out experiments estimate true total causal eect
 Prediction: IDA
Patterns
 Markov Equivalence Class (MEC) contains
all DAGs consistent with set of conditional
independences
 All DAGs in the MEC share the same
adjacencies
 All share the same “unshielded colliders”
 Can be represented with a “pattern”:
 common \skeleton" of adjacencies
 edges directed when all the DAGs within the
MEC agree on their orientation
 edges left undirected otherwise
Intervention Effects When the
DAG is Absent (IDA)
• Run PC.
• Calculate effect q of manipulating arcA on sodA
in each graph.
• Count how many times each value q occurs.
fnr
arcA
sodA
fnr
fnr
arcA
0
sodA
arcA
0
sodA
fnr
fnr
z
arcA
q1
arcA
q2
sodA
sodA
fnr
{0,0,0,q1,q2,q3}
arcA
0
sodA
fnr
arcA
q3
sodA
IDA
fnr
• Too expensive if many variables.
• By considering only local structure around sodA and
arcA can calculate all possible effects, but not how
many times they occur.
arcA
sodA
arcA
sodA
fnr
arcA
sodA
fnr
fnr
fnr
{0,q1,q2,q3}
arcA
arcA
sodA
sodA
fnr
fnr
arcA
q1
q2
sodA
arcA
q1
q2
sodA
IDA application
 p = 5360 genes (expression of genes)
´6 intervention effects
 231 gene knock downs ; 1.2 10
 the truth is “known in good approximation” (thanks to
intervention experiments)
 goal: prediction of the true large intervention effects based
on observational data with no knock-downs
 n = 63
 observational data
IDA
Maathuis
CStar
 Runs IDA multiple times in order to choose the genes
that are most stably among the ones selected as the
strongest.
 Sample 50% of the original data set (with replacement)
one hundred times.
 For each subsample, run the IDA algorithm.
 Take the output of the IDA algorithm for that subsample
and orders the variables by the size of the estimated
(by IDA) lower bound of the total effect of the variable
on the target.
32
Stability Selection Step
 Record the frequency with which a given variable
appears in the top q of the total effect sizes, for a user
selected value q.
 Select the variables that appear with the highest
frequency (that is are judged most often to have lower
bounds of total effects on the target that are large)
33
Stability Selection
 order the variables Π1 > Π2 . . . Πp.
most often
least often
 Define the stably selected genes (covariates) as
Ŝstable = { j :P j > p thr } for some threshold 0.5 < πthr ≤ 1.
 Denote the wrongly selected genes (false positives) by
Ŝstable Ç Swhere
V=
Sfalse is the set of (covariates)
false
whose true lower bound
Ŝ βj isÇ0.S
stable
stable
Ŝ
false
false
ÇS
34
Stability Selection
 For a given threshold πthr and a given value of q,
q2
E[V ] £
2 p thr -1 p
1
 if p thr
E[V ]
1 q
= P̂ j then
£
2
p
2P̂ j -1 p
2
35
Stability Selection
 CStaR is relatively insensitive to the choice of the
range of qs.
 Down to a certain lower bound, small values of q lead
to higher sensitivity.
 For q-values below the lower bound, the ranking
becomes unstable again.
36
Stability Selection
 All genes are ranked according to the median rank with
respect to the different q-values.
 Ties in the final ranking are sorted according to median
total causal effect size.
37
Cstar Results
Choosing Experiments:
Steckhoven et al.
 Mouse-ear cress response
Y: days to bolting
(flowering) of the plant
 Covariates X: geneexpression profile
 Observational data with n =
47 and p = 21,326
39
Experimental Confirmation:
Steckhoven et al.
 PC + IDA + stability
selection
 Performed experiment on
14 of the top 20 (not
previously known, easily
available mutant)
40
Results: Arabidopsis thaliana
 9 among the 14 mutants
survived
 4 among the 9 mutants
(genes) showed a significant
effect for Y relative to the
wildtype (non-mutated plant)
41
Chen and Storey
 “For an individual organism, DNA has the useful feature
thatit is usually a static variable, meaning that it is fixed
and will not change with changing RNA levels, protein
levels, phenotypes,or environmental conditions. By
performing designed crosses of genetically distinct
inbred or isogenic lines, one can randomize the
genotypes of an organism from two or more genetic
backgrounds, thereby producing independent
realizations of DNA content from offspring to offspring.”
Chen and Storey
 Identify causal relations of the form L → Ti → Tj where L is
known to be exogenous and prior to Ti and Tj
 L is the genotype at a fixed locus, generated through
crossing two haploid parental strains to produce 112
recombinant haploid segragant strains
 Ti and Tj are expression levels of genes
 Given this background knowledge, we just need to
determine that L and Ti are dependent, Ti and Tj are
dependent, and Tj is independent of Ti given L.
Faith et al
 Used Many Microbes Microarrays Database (M3D)
 Evaluated using RegulonDB
 Restricted search space: only allowed edges out of
genes coding for TFs
 Compared several search algorithms (but not a fair
comparison for Bayes Net learning algorithm)
References

LS Chen LS, F Emmert-Streib, and JD Storey (). Harnessing naturally randomized transcription to
inferregulatory relationships among genes. Genome Biology. 2007.

T. Chu C. Glymour, R. Scheines, P. Spirtes. A statistical problem for inference to regulatory structure
from associations of gene expression measurement with microarrays. Bioinformatics 2003;19:1147-52.
PMID: 12801876.

Jeremiah J. Faith, Boris Hayete, Joshua T. Thaden, Ilaria Mogno, Jamey Wierzbowski, Guillaume
Cottarel, Simon Kasif, James J. Collins, and Timothy S. Gardner. Largescale mapping and validation
of escherichia coli transcriptional regulation from a compendium of expression profiles. PLOS Biology,
5(1):0054–0066, 2007.

J Li, M Biggin, Statistics requantitates the central dogma, Science, 347(6226), 1066-1067, 2015.

Marloes H Maathuis, Diego Colombo, Markus Kalisch, and Peter Bühlmann. Predicting causal effects
in large-scale systems from observational data. Nature Methods, 7(4):247–248, 2010.

K. Sachs, O. Perez, D. Pe’er, D.A. Lauffenburger, and G.P. Nolan. Causal protein-signaling networks
derived from multiparameter single-cell data. Science 308, 523–529, (2005).

Daniel J. Stekhoven, Izabel Moraes, Gardar Sveinbjornsson, Lars Hennig, Marloes H. Maathuis, and
Peter Buhlmann, Causal stability ranking, Bioinformatics, 28 (21) 2012, pp. 2819–2823
Download