# ppt slides ```An Introduction to Causal Modeling and
Discovery Using Graphical Models
Greg Cooper
University of Pittsburgh
Overview





Introduction
Representation
Inference
Learning
Evaluation
What Is Causality?



Much consideration in philosophy
I will treat it as a primitive
Roughly, if we manipulate something and
something else changes, then the former causally
influences the latter.
Why is Causation Important?


Causal issues arise in most fields including
medicine, business, law, economics, and the
sciences
An intelligent agent is continually considering
what to do next in order to change the world
(including the agent’s own mind). That is a
causal question.
Representing Causation Using
Causal Bayesian Networks



A causal Bayesian network (CBN) represents
some entity (e.g., a patient) that we want to
model causally
Features of the entity are represented by
variables/nodes in the CBN
Direct causation is represented by arcs
An Example
of a Causal Bayesian Network
Structure
History of Smoking (HS)
Chronic Bronchitis (CB)
Lung Cancer (LC)
Fatigue (F)
Weight Loss (WL)
An Example of the Accompanying
Causal Bayesian Network
Parameters
P(HS = no) = 0.80
P(HS = yes) = 0.20
P(CB = absent | HS = no) = 0.95
P(CB = absent | HS = yes) = 0.75
P(CB = present | HS = no) = 0.05
P(CB = present | HS = yes) = 0.25
P(LC = absent | HS = no) = 0.99995
P(LC = absent | HS = yes) = 0.997
P(LC = present | HS = no) = 0.00005
P(LC = present | HS = yes) = 0.003



Causal Markov Condition




A node is independent of its non-effects given just its
direct causes.
This is the key representational property of
causal Bayesian networks.
Special case: A node is independent of its
distant causes given just its direct causes.
General notion: Causality is local
Causal Modeling Framework


An underlying process generates entities that
share the same causal network structure. The
entities may have different parameters
(probabilities).
Each entity independently samples the joint
distribution defined by its CBN to generate
values (data) for each variable in the CBN model
Entity Generator
HS1
HS2
HS3
LC1
LC2
LC3
WL1
WL2
WL3

(no, absent, absent)
(yes, present, present) (yes, absent, absent)
(no, absent, absent)
(yes, absent, absent)
existing
entities
entity
feature
values
samples
Discovering the Average Causal
Bayesian Network
HSavg
LCavg
WLavg
Some Key Types of Causal Relationships
Direct
Causation
HS
Indirect
Causation
Confounding
HS
F
HS
CB
LC
Sampling bias
WL
LC
Sampled = true
WL
Inference Using a Single CBN
When Given Evidence
in the Form of Observations
P(F | CB = present, WL = present, CBN1)
History of Smoking (HS)
Chronic Bronchitis (CB)
Lung Cancer (LC)
Fatigue (F)
Weight Loss (WL)
Inference

The Markov Condition implies the following
equation:
n
P( X 1 , X 2 ,..., X n ) 
 P( X i | DirectCauses( X i ))
i 1


The above equation specifies the full joint
probability distribution over the model variables.
From the joint distribution we can derive any
conditional probability of interest.
Inference Algorithms




In the worst case, the brute force algorithm is
exponential time in the number of variables in the
model
Numerous exact inference algorithms have been
developed that exploit independences among the
variables in the causal Bayesian network.
However, in the worst case, these algorithms are
exponential time.
Inference in causal Bayesian networks is NP-hard
(Cooper, AIJ, 1990).
Inference Using a Single CBN
When Given Evidence
in the Form of Manipulations
P(F | MCB = present, CBN1)
• Let MCB be a new variable that can have the same
values as CB (present, absent) plus the value observe.
• Add an arc from MCB to CB.
• Define the probability distribution of CB given its
parents.
Inference Using a Single CBN
When Given Evidence
in the Form of Manipulations
P(F | MCB = present, CBN1)
MCB
History of Smoking (HS)
Chronic Bronchitis (CB)
Lung Cancer (LC)
Fatigue (F)
Weight Loss (WL)
A Deterministic Manipulation
P(F | MCB = present), CBN1)
MCB
History of Smoking (HS)
Chronic Bronchitis (CB)
Lung Cancer (LC)
Fatigue (F)
Weight Loss (WL)
Inference Using a Single CBN
When Given Evidence
in the Form of Observations and Manipulations
P(F | MCB = present, WL = present, CBN1)
MCB
History of Smoking (HS)
Chronic Bronchitis (CB)
Lung Cancer (LC)
Fatigue (F)
Weight Loss (WL)
Inference Using Multiple CBNs:
Model Averaging
P( X | manipulate(Y' ), observe( Z' )) 
 P( X | manipulate(Y' ), Z' , CBN i )  P(CBN i | data, K )
i
Some Key Reasons
for Learning CBNs

Scientific discovery among measured variables
Example of general: What are the causal
relationships among HS, LC, CB, F, and WL?
 Example of focused: What are the causes of LC
from among HS, CB, F, and WL?



Scientific discovery of hidden processes
Prediction

Example: The effect of not smoking on contracting
lung cancer
Major Methods for Learning CBNs
from Data

Constraint-based methods
Uses tests of independence to find patterns of
relationships among variables that support causal
relationships
 Relatively efficient in discovery of causal models
with hidden variables
 See talk by Frederick Eberhardt this morning


Score-based methods Bayesian scoring


Allows informative prior probabilities of causal structure
and parameters
Non-Bayesian scoring

Does not allow informative prior probabilities
Learning CBNs from Observational Data:
A Bayesian Formulation
P( X  Y | D, K ) 
 P(S i | D, K )
i: Si containsX Y
where D is observational data,
Si is the structure of CBNi,
and K is background knowledge and belief.
Learning CBNs from Observational Data
When There Are No Hidden Variables
P( S i | D, K ) 

P( S i | K ) P( D | S i , i , K ) P( i | S i , K ) d i
 P(S j | K )  P(D | S j , j , K ) P( j | S j , K ) d j
j
where i are the parameters associated with Si and the
sum is over all CBNs for which P(Sj | K) &gt; 0.
The BD Marginal Likelihood

The previous integral has the following closed
form solution, when we assume Dirichlet priors
(ijk and ij), multinomial likelihoods (Nijk and
Nij denote counts), parameter independence, and
parameter modularity:
( ijk  N ijk )
 (  N )  ( )
i 1 j 1
ij
ij k 1
ijk
n
qi
( ij )
ri
Searching for Network Structures





Greedy search often used
Hybrid methods have been explored that constraints
and scoring
Some algorithms guarantee locating the generating
model in the large sample limit (assuming Markov and
Faithfulness conditions), as for example the GES
algorithm (Chickering, JMLR, 2002)
The ability to approximate the generating network is
often quite good
An excellent discussion and evaluation of several stateof-the-art methods, including a relatively new method
(Max-Min Hill Climbing) is at:
Tsamardinos, Brown, Aliferis, Machine Learning, 2006.
The Complexity of Search

Given a complete dataset and no hidden
variables, locating the Bayesian network
structure that has the highest posterior
probability is NP-hard (Chickering, AIS, 1996;
Chickering, et al, JMLR, 2004).
Experimental Data Together
than from Either One Alone
H
C
E
We cannot learn the above causal structure
from observational or experimental data
alone. We need both.
Learning CBNs from Observational Data
When There Are Hidden Variables
P( S i | K )
P ( S i | D, K ) 
  P(D | S i , i , K ) P( i | S i , K ) d i
Hi
 P(S j | K )  P( D | S j , j , K ) P( j | S j , K ) d j
j
Hj
where Hi (Hj) are the hidden variables in Si (Sj) and the
sum in the numerator (denominator) is taken over all
values of Hi (Hj).
Learning CBNs from Observational and
Experimental Data:
A Bayesian Formulation
• For each model variable Xi that is experimentally manipulated in at least one
case, introduce a potential parent MXi of Xi .
• Xi can have parents as well from among the other {X1, ..., Xi-1, Xi+1, ..., Xn}
domain variables in the model.
• Priors on the distribution of Xi will include conditioning on MXi,when it is a
parent of Xi, as well as conditioning on the other parents of Xi.
• Define MXi to have the same values vi1, vi2, ... , viq as Xi, plus a value o (for
observe).
o When MXi has value vij in a given case, this represents that the
experimenter intended to manipulate Xi to have value vij in the case.
o When MXi has value observe in a given case, this represents that no
attempt was made by the experimenter to manipulate Xi, but
rather, Xi was merely observed to have the value recorded for it.
• With the above variable additions in place, use the previous Bayesian
methods for causal modeling from observational data.
An Example Database Containing
Observations and Manipulations
HS
MCB
CB LC
F
WL
T
obs
T
F
T
T
F
F
F
T
T
F
F
F
T
T
F
F
F
T
F



T
obs
F
Faithfulness Condition

Faithfulness Condition
Any independence among variables in the data
generating distribution follows from the Markov
Condition applied to the data generating causal
structure.

A simple counter example:
H
C
E
Challenges of Bayesian Learning
of Causal Networks

Major challenges






Large search spaces
Hidden variables
Feedback
Assessing parameter and structure priors
Modeling complicated distributions
The remainder of this talk will summarize several
methods for dealing with hidden variables, which is
arguably the biggest major challenge today

These examples provide only a small sample of previous
research
Learning Belief Networks in the Presence of
Missing Values and Hidden Variables
(N. Friedman, ICML, 1997)




Assumes a fixed set of measured and hidden variables
Uses Expectation Maximization (EM) to “fill in” the
values of the hidden variable
Uses BIC to score causal network structures with the
filled-in data. Greedily finds best structure and then
returns to the EM step using this new structure.
Some subsequent work


Use patterns of induced relationships among the measured
variables to suggest where to introduce hidden variables
(Elidan, et al., NIPS, 2000)
Determining the cardinality of the hidden variables
introduced (Elidan &amp; Friedman, UAI, 2001)
A Non-Parametric Bayesian Methods for
Inferring Hidden Causes
(Wood, et al., UAI, 2006)

Learns hidden causes of measured variables
hidden variables
measured variables




Assumes binary variables and noisy-OR interactions
Uses MCMC to sample the hidden structures
Allows in principle an infinite number of hidden
variables
In practice, the number of optimal hidden variables is
constrained by the measured data
Bayesian Learning of Measurement and
Structural Model
(Silva &amp; Scheines, ICML, 2006)

Learns the following type of models
hidden variables
measured variables

Assumes continuous variables, mixture of
Gaussian distributions, and linear interactions
Mixed Ancestral Graphs*


A MAG(G) is a graphical object that contains only the observed
variables, causal arcs, and a new relationship  for representing
hidden confounding.
There exist methods for scoring linear MAGS (Richardson &amp;
Spirtes Ancestral Graph Markov Models, Annals of Statistics,
2002)
SES
SES
SEX
PE
L1
CP
SEX
PE
L2
IQ
Latent Variable DAG
IQ
Corresponding MAG
* This slide was adapted from a slide provided by Peter Spirtes.
CP
A Theoretical Study of Y Structures
for Causal Discovery
(Mani, Spirtes, Cooper, UAI, 2006)


Learn a Bayesian network structure on the measured
variables
Identify patterns in the structure that suggest causal
relationships
A
B
C
D
E
F

The “Y” structure shown in green supports that D is an
unconfounded cause of F.
Causal Discovery
Using Subsets of Variables

Search for an estimate M of the Markov blanket
of a variable X (e.g., Aliferis, et al., AMIA, 2002)


X is independent of other variables in the generating
causal network model, conditioned on the variables
in X’s Markov blanket
Within M search for patterns among the
variables that suggest a causal relationship to X
(e.g., Mani, doctoral dissertation, Un. of
Pittsburgh, 2006)
Causal Identifiability

Generally depends upon
Markov Condition
 Faithfulness Condition
 Informative structural relationships among the
measured variables


Example of the “Y structure”:
A
B
C
E
Evaluation of Causal Discovery


In evaluating a classifier, the correct answer in any
instance is just the value of some variable of interest,
which typically is explicitly in the data set. This make
evaluation relatively straightforward.
In evaluating the output of a causal discovery
algorithm, the answer is not in the dataset. In general
we need some outside knowledge to confirm that the
causal output is correct. This makes evaluation
relatively difficult. Thus, causal discovery algorithms
have not been thoroughly evaluated.
Methods for Evaluating Causal
Discovery Algorithms




Simulated data
Real data with expert judgments of causation
Real data with previously validated causal
relationships
Real data with follow up experiments
An Example of an Evaluation Using
Simulated Data
(Mani, poster here)



Generated 20,000 observational data samples from each
of five CBNs that were manually constructed
Applied the BLCD algorithm, which considers many 4variable subsets of all the variables and applies Bayesian
scoring. It is based on the causal properties of “Y”
structures.
Results


Precision: 83%
Recall: 27%
An Example of an Evaluation Using
Previously Validated Causal Relationships
(Yoo, et al., PSB, 2002)





ILVS is a Bayesian method that considers pairwise relationships
among a set of variables
It works best when given both observational and experimental
data
ILVS was applied to a previously collected DNA microarray
dataset on 9 genes that control galactose metabolism in yeast
(Ideker, et al., Science, 2001) The causal relationships among the
genes have been extensively studied and reported in the
literature.
ILVS predicted 12 of 27 known causal relationships among the
genes (44% recall) and of those 12 eight were correct (67%
precision)
Yoo has explored numerous extensions to ILVS
An Example of an Evaluation Using Real
(Sachs, et al., Science, 2005)




Experimentally manipulated human immune system
cells
Used flow cytometry to measure the effects on 11
proteins and phospholipids on a large number of
individual cells
Used a Bayesian method for causally learning from
observational and experimental data
Derived 17 causal relationships with high probability



15 highly supported by the literature (precision = 15/17 =
88%)
The other two were confirmed experimentally by the
authors (precision = 17/17 = 100%)
Three causal relationships were missed (“recall” = 17 /20 =
85%)
A Possible Approach to Combining
Causal Discovery and Feature Selection
1.
2.
3.
4.
5.
6.
Use prior knowledge and statistical associations to develop
overlapping groups of features (variables)
Derive causal probabilistic relationships within groups
Have the causal groups constrain each other
Determine additional groups of features that might constrain
causal relationships further
Either go to step 2 or step 6
Model average within and across groups to derive approximate
model-averaged causal relationships
David Danks 2002. Learning the Causal Structure of Overlapping Variable
Sets. In S. Lange, K. Satoh, &amp; C.H. Smith, eds. Discovery Science: Proceedings of
the 5th International Conference. Berlin: Springer-Verlag. pp. 178-191.
Some Suggestions
for Further Information

Books





Conferences


Glymour, Cooper (eds), Computation, Causation, and Discovery
(MIT Press, 1999)
Pearl, Causality: Models, Reasoning, and Inference (Cambridge
University Press, 2000)
Spirtes, Glymour, Scheines, Causation, Prediction, and Search
(MIT Press, 2001)
Neapolitan, Learning Bayesian Networks (Prentice Hall, 2003)
UAI, ICML, NIPS, AAAI, IJCAI
Journals

JMLR, Machine Learning
Acknowledgement
Thanks to Peter Spirtes for his comments on an
outline of this talk
```