Identifying Hidden Variables From Context-Specific Independencies M.J. Sanscartier

Identifying Hidden Variables From Context-Specific Independencies
M.J. Sanscartier and E. Neufeld
Department of Computer Science, University of Saskatchewan
57 Campus Drive, Saskatoon, Saskatchewan, Canada S7K 5A9
not a priori available, they can be treated as hidden variables in the data. In previous research, two
discovery methods have been proposed to discover
CSI (Zhang & Poole 1999; Butz & Sanscartier
2002). In this paper, we show that by considering
the semantics underlying CSI, we can uncover hidden variables, thus leading to a fuller, more accurate account of the given model. Largely, research
in Artificial Intelligence (AI) related to context
has not addressed semantics or cognitive significance. In this paper, we use CSI and CSI discovery
methods to study the meaning of CIs, rather than
the well-known algorithmic and inferencial benefits generalized CIs provide (Zhang & Poole 1999;
Butz & Sanscartier 2002). We show how the original, perhaps erroneous model can be refined, and
thus more accurate, based on the discovery of the
hidden contextual information.
The remainder of this paper is organized as follows: First, we present some background information. We discuss Bayesian networks and causal
models as representation and inference tools. Then
we give an example of a distribution containing
hidden variables. Finally, we discuss a method for
discovering hidden variables from data, by means
of contextual independencies. In our conclusions,
we outline some potential future directions.
Abstract
Learning a Bayesian network from data is a model
specific task, and thus requires careful consideration of contextual information, namely, contextual
independencies. In this paper, we study the role
of hidden variables in learning causal models from
data. We show how statistical methods can help us
discover these hidden variables. We suggest hidden variables are wrongly ignored in inference, because they are context-specific. We show that contextual consideration can help us learn more about
true causal relationships hidden in the data. We
present a method for correcting models by finding
hidden contextual variables, as well as a means for
refinining the current, incomplete model.
Introduction
Contextual information must be carefully considered (Sanscartier & Neufeld 2006) to successfully
learning a Bayesian Network (BN) from the available data. Context has been studied formally in uncertain reasoning with BNs. Pearl’s (Pearl 1988)
work has made the storing of a distribution for a
large data set unnecessary since all conditional independencies (CIs) in the BN allow for the distribution to be stored more compactly. The idea
of CI has been generalized (Boutilier et al. 1996)
to context-specific independence (CSI) to achieve
even smaller distributions and to further improve
inference (Zhang & Poole 1999).
A BN is built from known CIs. However, for
the more general CSI, the independencies must be
discovered from the available data. Since CSIs are
Background Information
Humans base their decisions on their understanding of cause and effects in the world, so decision
models, such as Bayesian networks (BNs), are appropriate for representing causality. In this section, we discuss BNs and how they are a useful
representational tool for causal relations. We then
c 2006, American Association for Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
472
A ∈ R takes on values from a finite domain VA .
We use capital letters, such as A, B, C, for variable names and lowercase letters a, b, c to denote
outcomes of those variables.
Let X and Y be two disjoint subsets of variables
in R and let Z = R − {X ∪ Y }. We say that Y and
Z are conditionally independent given X, denoted
I(Y, X, Z) if, given any x ∈ Vx , y ∈ Vy , then for
all z ∈ Vz
discuss causal models and present an example of
such a model. We will revisit the example when
we learn hidden variables from data, which will
improve our current model.
Bayesian Networks and Causality
A Bayesian network (BN) (Pearl 1988) is a directed acyclic graph with a conditional probability
distribution associated with each node. The topology of the graph encodes the information that the
joint distribution of all variables in the graph is
equal to the product of the local distributions. BNs
compactly represent joint probability distributions,
and reason efficiently with those representations.
There is a significant literature on inference; (Pearl
1988) is a good place to start.
BN practitioners noticed early on that typical independence assumptions (unconditional independence of diseases, conditional independence of
symptoms) in the diagnosis domain, for example,
tended to orient arcs in the direction of causality.
Pearl and Verma (Pearl & Verma 1991) provided
probabilistic definitions of causality that explained
this phenomenon, but also provided algorithms for
learning cause-effect relationships from raw data.
p(y|x, z) = p(y|x), when p(x, z) > 0.
With the causal model alone, we can express
portions of the causal knowledge based on the CIs
in the model. The conditional probabilities resulting from the CIs defined in the model can be formally expressed for all configurations in the Cartesian product of the domains of the variables for
which we are storing conditional probabilities.
Definition 3 Let X and Y be two subsets of variables in R such that p(y) > 0. We define the conditional probability distribution (CPD) of X given
Y = y as:
p(x|y) =
p(x,y)
p(y) ,
or p(x, y) = p(y) · p(x|y),
for all configurations in Vx × Vy .
Causal Models
Example of a Causal Model
Several authors express causal models in probabilistic terms because, as argued by Suppes (Suppes 1970), most causal statements in everyday conversation are a reflection of probabilistic and not
categorical relations. For that reason, probability theory should provide an adequate framework
for reasoning with causal knowledge (Good 1983;
Reichenbach 1956). Pearl’s causal models provide
the mechanism and structure needed to allow for
a representation of causal knowledge based on the
presence and absence of CIs.
Businesses gather information to better understand
needs and preferences of customers. That knowledge helps businesses make decisions such as expanding product lines, targeting a particular age
group in promotions, etc. With the possibility
to quickly transmit masses of information over
the web, companies can easily and inexpensively
gather useful data when customers view a company’s website.
For example, consider the fictional airline company Fly4U, an international carrier specializing in
leisure travel. Fly4U investigates user behaviour
by means of causal models. Each month (since
travel varies widely by season), a causal model is
created for the potential popularity of a specific
destination. The company obtains user information through their website containing travel information for all available destinations. Upon every
visit to the website, Fly4U collects a profile of the
user, acquired in the form of a short survey.
The causal model in Figure 1 describes causal
relationships between five variables directly related to the choice of a destination on a given
Definition 1 A causal model (Pearl & Verma
1991) of a set of random variables R can be represented by a directed acyclic graph (DAG), where
each node corresponds to an element in R and
edges denote direct causal relationships between
pairs of elements of R.
The direct causal relations in the causal model
can be expressed in terms of probabilistic conditional independencies (CIs) (Pearl 1988).
Definition 2 Let R = {A1 , A2 , . . . , An } denote a
finite set of discrete variables, where each variable
473
Variable
¯ A
month. The five variables are the following: (A)ge
group of user, (C)ost of hotels user is considering, known holidays at place of (O)rigin of
user, whether user (H)as children, and finally,
(D)estination of travels. Fly4U uses the information to establish ticket rates to maximize sales and
offer promotional rebates for relevant travel groups
on any given month.
A
H
C
O
C
O
H
Di
0
< 25
Inexpensive
No
No
No
1
≥ 25
Expensive
Yes
Yes
Yes
Table I: Instantiations for A, C, O, H, and D.
example, which is based solely on CI. We then introduce CSI and how it allows us to consider contexts and therefore have a starting point for finding hidden variables. Once again, we illustrate this
with our running example. In the subsequent subsection, we show how the discovery of CSIs in the
data helps refine and correct our existing model.
D
Figure 1: Example Causal Model.
Instantiation of a CPD in Fly4U Example
For simplicity, we model the month of February only. According to the DAG, there is a
direct causal relationship between variable A
and C. There is also a direct causal influence
from A to the origin (O). The last causal relationship emerging from A is dirented to the
destination (D). Whether the user has children
(H) is causally related to relevant holidays at
the point of origin (O). In turn, variable O is
directly related to D, the destination. Finally,
hotel cost (C) and destination (D) are directly
related. The CPDs corresponding to variables A,
H, O, C, and D respectively are the following:
p(A), p(H), p(O|A, H), p(C|A) , p(D|A, C, O).
The causal model for February in Figure 1
seems reasonable and intuitive. However, discovery of hidden variables may yield some interesting information about subsets of the variables,
that may give useful indicators as to what user
group to target when offering promotional material. Pearl and Verma (Pearl & Verma 1991)
make clear that possibility in their treatment of
exceptions: “causal expressions often tolerate exceptions, primarily due to missing variables and
coarse descriptions” (Pearl & Verma 1991).
From the user profiles, Fly4U wants to better understand traveling trends in February. For ease of
illustration, we assume the domain of every variable to be binary. To make this possible for the
(D)estination variable, we let Di = D1 . . . Dn ,
where Di represents one particular destination. We
further separate the data into n causal models, one
for each destination on a given month. In a given
causal model for month X and destination Di , the
variable Di takes on value 0 for no travel to destination Di , and 1 for travel to destination Di . The
causal model in Figure 2 illustrates this slight modification, where we investigate travel to destination
Di for the month of February. Instantiations for
the five variables are presented in Table I.
February
A
H
C
O
Di
Figure 2: Causal model for February.
Discovery of Hidden Variables
From the causal model in Figure 2, variables directly related to Di are Cost of Hotel (C), Age
Group (A), and Major Relevant Holiday at Origin
(O). The CPD associated with these conditionally
independent variables is presented in Table II.
Since BNs operate on the general notion of CIs, it
is difficult to consider hidden variables in the data
or even to be aware of their presence. In this section, we first instantiate a CPD from our running
474
A
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
C
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
O
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
Di
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
Definition 4 Let X, Y, Z, C be pairwise disjoint
subsets of variables in R, and let c ∈ Vc . We say
that Y and Z are conditionally independent given
X in context C = c, denoted IC=c (Y, X, Z) if,
p(Di |A, C, O)
0.80
0.20
0.10
0.90
0.80
0.20
0.10
0.90
0.15
0.85
0.15
0.85
0.05
0.95
0.05
0.95
p(y|x, z, c) = p(y|x, c), when p(x, z, c) > 0.
Since we are dealing with partial CPDs, a more
general operator is necessary for manipulating
CPDs containing CSIs. This operator (Zhang &
Poole 1999) is called the union-product, and is represented as .
CSI Discovery
Table II: The CPD p(Di |A, C, O).
To attempt to target specific groups within the
available information, we need to consider context.
If we can conclude that there exists a particularity
about a context of a variable, we may be able to
detect a group responsible for the partial popularity of travels to Di during the month of February.
In this subsection, we see that a consideration of
context may actually change the original model in
Figure 2. We use a CSI detection method called
Refine-CPD-tree (Butz & Sanscartier 2002). The
method is based on a tree representation of a CPD.
Using this algorithm, we can see if a tree reduction
is possible. If such a reduced tree exists, the data
contains a CSI, which is an indication of a hidden
variable that could perhaps correct a faulty model
that may otherwise appear to be correct. The detection method works as follows: Given a tree representation of a CPD, if all children of a node A are
identical, then replace A by one of its offspring,
and delete all other children of A.
In our running example, we have a CPD, which
contains all available information relevant to travels to Di in February, as depicted in Table II. Recall that no variables can be removed from that
distribution based on CI, since the independence
would have to hold for all values in the distribution. The Refine-CPD algorithm can determine if
context-specific independencies reside in the data.
The CPD in Table II can be represented as the
CPD-tree in Figure 3.
Running the Refine-CPD algorithm yields the
refined CPD-tree in Figure 4. The variable C no
longer appears on the LHS of the tree, in the context A = 0. In addition, on the RHS of the tree, in
context A = 1, the variable O no longer appears.
This suggests a hidden relationship in variable A
in context A = 0 and in context A = 1.
Table II shows that some users have strong interest in traveling to Di , while others show no interest at all. There is no clear indication that February is particularly popular for travels to Di for a
group of users; if it were the case for the majority
of users, most probability values in the table would
be quite high. From the available information, we
cannot determine if a particular subset of users is
more likely to travel to Di in February. We know
that some interest is shown, but no information indicates whether the interest is random or particular to a discrete subset. Finally, it is impossible
to remove any of the variables in the table without
altering the probability values of other configurations. All values would have to be marginalized
and re-normalized and would therefore be different. Below, we see how a discovery method for
hidden variables reveals strong influences hidden
in this seemingly inconclusive CPD.
Context-Specific Independence (CSI)
Boutilier et al. (Boutilier et al. 1996) formalized
the notion of context-specific independence. Without CSI, it is only possible to establish a causal
relationship between two variables if a certain set
of CIs is absent for all values of a variable in
the distribution. With CSI, we can recognize CIs
that hold for a subset of values of a variable in a
distribution. Discovery of CSI can help us build
more specific causal models instead of a single
causal model ignoring particular subsets of values.
Boutilier et al. define CSI as follows:
475
A
0
C
0
0
O
O
0
Di
0
1
0.80
0.20 0.10
0
Di
1
0.90 0.80
0
1
Di
1
C
0
1
1
Di
0
1
0
0.20 0.10
O
O
1
0
Di
1
0
Di
0
1
0.90 0.15
1
0.85 0.15
1
Di
1
0
Di
1
0.85 0.05
0
1
0.95 0.05
0.95
Figure 3: Initial CPD-tree for p(Di |A, C, O).
A
0
0
O
C
1
0
Di
0
0.80
Di
1
ACODi
1
0
0.20 0.10
1
1
Di
0
0.90 0.15
Di
1
0
0.85 0.05
1
0.95
Figure 4: Refined CPD-tree for p(Di |A, C, O).
Uncovering Hidden Variables
The previous subsection showed that a CSI discovery algorithm can uncover hidden relationships in
a CPD when no causal independencies can be inferred by considering the entire dataset. The example showed that some user group in A (A < 25)
may help explain the partial popularity of travels
to Di in February. If we look again at Table II,
and consider A = 0 and A = 1 separately, removing C from the distribution in configurations
where A = 0 doesn’t change the likelihood of
occurrence of Di , whereas such a removal would
be impossible in the context A = 1. In A = 0,
p(Di |A = 0, O, C) = 0.80 when O = 0 and
Di = 0, 0.20 when O = 0 and Di = 1, 0.90
when O = 1 and Di = 0, and finally, 0.10 when
O = 1 and Di = 1. In context A = 1, saying
p(Di |A = 1, O, C) = 0.15 when O = 0 and
Di = 0, is not completely correct since it is also
true that in context A = 1, p(Di |A = 1, O, C) =
0.05 when O = 0 and Di = 0. In the first case of
context A = 1, C = 0, while in the second case,
C = 1. Therefore, the value of C does change the
probability of travels to Di in context A = 1, so
no removal is possible. We conclude that in context A = 0, variables Di and C are independent
given variable O. Such a separation is legal since
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
p(Di |A,C,O)
0.80
0.20
0.10
0.90
0.80
0.20
0.10
0.90
0.15
0.85
0.15
0.85
0.05
0.95
0.05
0.95
(i)
ACODi
p(Di |A=0,C,O)
ACODi
p(Di |A=1,C,O)
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
0.80
0.20
0.10
0.90
0.80
0.20
0.10
0.90
(ii)
0.15
0.85
0.15
0.85
0.05
0.95
0.05
0.95
→
→
AODi
p(Di |A=0,O)
ACDi
p(Di |A=1,C)
000
001
010
011
100
101
110
111
0.80
0.20
0.90
0.10
0.15
0.85
0.05
0.95
(iii)
Table III: Representation of CSIs
no information is lost due to the union-product operator discussed in Section . From the resulting
CPDs, we may now make more adequate judgments about the user. The CPD after refinement
is presented in Table III. When variable A takes
on value 0 (A < 25), travel to Di in February
is extremely popular and hotel cost doesn’t seem
to be an issue. This puzzling conclusion leads
Fly4U to investigate further. Since, also in that
context, there is a strong dependency between the
age group (A) and a relevant holiday at the origin
(O), that factor is also taken into consideration for
the investigation. This discovery allows Fly4U to
enter a smaller reference class for their investigation, which makes the problem easier to resolve. A
large number of people aged under 25 are university students. In February, universities have “Reading Week” break. During Reading Week, many
travel on organized trips. Due to group discounts,
476
Conclusions and Future Work
hotel costs are not a large issue despite the travellers’ young age. On the other hand, users aged
over 25 are likely no longer students and are probably working. Most places of employment do not
have a February break and therefore, variable O,
holiday at origin, does not serve that portion of
the population. Finally, if people from that group
(over 25) were traveling in February regardless,
they would be more likely to pay attention to hotel
costs.
Studies frequently reveal that learners learn differently in different contexts or that marketing strategies perform differently in different contexts. For
example, continuing with the theme of travel websites, it seems reasonable to expect that different groups will respond differently to presentation
styles, and website designers will want to identify membership in a subgroup before dynamically
choosing such a strategy. Given sufficient traffic,
it may be possible to build very detailed models
and to discover many context-specific strategies
for maximizing sales in target groups. The example shown here shows that even given smaller models, it is possible to discover such contexts, and that
such contexts lead us to very different conclusions.
Correcting the Model
Since there is no longer mention of variable C in
context A = 0, we can refine our causal model
by removing the direct causal link between C and
Di , and similarly in context A = 1 for variable
O. With the uncovered hidden contexts of variable A, when considering the probability of travel
to destination Di , given all factors that have a direct causal link with Di , the initial causal model in
Figure 2 can be represented by two more specific
causal models that account for differences between
particularities about subsets of the data depending
on age group. Those refined causal models are illustrated in Figure 5, where the LHS represents the
refined model for context A = 0 (under 25), and
the RHS represents the refined model for context
A = 1 (over 25).
A
Di
H
A
O
C
References
Boutilier, C.; Friedman, N.; Goldszmidt, M.; and
Koller, D. 1996. Context-specific independence
in bayesian networks. In Proc. of the 12th UAI,
115–123.
Butz, C., and Sanscartier, M. 2002. A method for
detecting context-specific independence in conditional probability tables. In 3rd Int’l. RSCTC02,
344–348.
Good, I. 1983. A causal calculus. British Journal
for Philosophy of Science 11.
Pearl, J., and Verma, T. 1991. A theory of inferred
causation. In Principles of Knowledge Representation and Reasoning, 441–452. Morgan Kaufmann.
Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
San Fransisco USA: Morgan Kaufmann.
Reichenbach, H. 1956. The Direction of Time.
Berkeley: University of California Press.
Sanscartier, M., and Neufeld, E. 2006. Discovering hidden dispositions and situational factors
in causal relations by means of contextual independencies. In 28th Conference of the Cognitive
Science Society.
Suppes, P. 1970. A Probabilistic Theory of Causation. Amsterdam: North Holland.
Zhang, N., and Poole, D. 1999. On the role
of context-specific independence in probabilistic
reasoning. In 16th IJCAI, 1288–1293.
Di
Figure 5: Causal Models After Discovery
From the discovery of CSI and an interpretation of the results permitted by the smaller reference class, we see that the causal models are substantially different when age groups are considered
separately. From the study, Fly4U could benefit
by offering group discounts to students for February, since they now know that a large portion of
the student population will be traveling. No airline
company with that kind of insight wants to lose an
entire target group of customers to another airline!
The techniques presented in this example can generalize to the wider class of uncertain reasoning.
477