Identifying Hidden Variables From Context-Specific Independencies M.J. Sanscartier and E. Neufeld Department of Computer Science, University of Saskatchewan 57 Campus Drive, Saskatoon, Saskatchewan, Canada S7K 5A9 not a priori available, they can be treated as hidden variables in the data. In previous research, two discovery methods have been proposed to discover CSI (Zhang & Poole 1999; Butz & Sanscartier 2002). In this paper, we show that by considering the semantics underlying CSI, we can uncover hidden variables, thus leading to a fuller, more accurate account of the given model. Largely, research in Artificial Intelligence (AI) related to context has not addressed semantics or cognitive significance. In this paper, we use CSI and CSI discovery methods to study the meaning of CIs, rather than the well-known algorithmic and inferencial benefits generalized CIs provide (Zhang & Poole 1999; Butz & Sanscartier 2002). We show how the original, perhaps erroneous model can be refined, and thus more accurate, based on the discovery of the hidden contextual information. The remainder of this paper is organized as follows: First, we present some background information. We discuss Bayesian networks and causal models as representation and inference tools. Then we give an example of a distribution containing hidden variables. Finally, we discuss a method for discovering hidden variables from data, by means of contextual independencies. In our conclusions, we outline some potential future directions. Abstract Learning a Bayesian network from data is a model specific task, and thus requires careful consideration of contextual information, namely, contextual independencies. In this paper, we study the role of hidden variables in learning causal models from data. We show how statistical methods can help us discover these hidden variables. We suggest hidden variables are wrongly ignored in inference, because they are context-specific. We show that contextual consideration can help us learn more about true causal relationships hidden in the data. We present a method for correcting models by finding hidden contextual variables, as well as a means for refinining the current, incomplete model. Introduction Contextual information must be carefully considered (Sanscartier & Neufeld 2006) to successfully learning a Bayesian Network (BN) from the available data. Context has been studied formally in uncertain reasoning with BNs. Pearl’s (Pearl 1988) work has made the storing of a distribution for a large data set unnecessary since all conditional independencies (CIs) in the BN allow for the distribution to be stored more compactly. The idea of CI has been generalized (Boutilier et al. 1996) to context-specific independence (CSI) to achieve even smaller distributions and to further improve inference (Zhang & Poole 1999). A BN is built from known CIs. However, for the more general CSI, the independencies must be discovered from the available data. Since CSIs are Background Information Humans base their decisions on their understanding of cause and effects in the world, so decision models, such as Bayesian networks (BNs), are appropriate for representing causality. In this section, we discuss BNs and how they are a useful representational tool for causal relations. We then c 2006, American Association for Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 472 A ∈ R takes on values from a finite domain VA . We use capital letters, such as A, B, C, for variable names and lowercase letters a, b, c to denote outcomes of those variables. Let X and Y be two disjoint subsets of variables in R and let Z = R − {X ∪ Y }. We say that Y and Z are conditionally independent given X, denoted I(Y, X, Z) if, given any x ∈ Vx , y ∈ Vy , then for all z ∈ Vz discuss causal models and present an example of such a model. We will revisit the example when we learn hidden variables from data, which will improve our current model. Bayesian Networks and Causality A Bayesian network (BN) (Pearl 1988) is a directed acyclic graph with a conditional probability distribution associated with each node. The topology of the graph encodes the information that the joint distribution of all variables in the graph is equal to the product of the local distributions. BNs compactly represent joint probability distributions, and reason efficiently with those representations. There is a significant literature on inference; (Pearl 1988) is a good place to start. BN practitioners noticed early on that typical independence assumptions (unconditional independence of diseases, conditional independence of symptoms) in the diagnosis domain, for example, tended to orient arcs in the direction of causality. Pearl and Verma (Pearl & Verma 1991) provided probabilistic definitions of causality that explained this phenomenon, but also provided algorithms for learning cause-effect relationships from raw data. p(y|x, z) = p(y|x), when p(x, z) > 0. With the causal model alone, we can express portions of the causal knowledge based on the CIs in the model. The conditional probabilities resulting from the CIs defined in the model can be formally expressed for all configurations in the Cartesian product of the domains of the variables for which we are storing conditional probabilities. Definition 3 Let X and Y be two subsets of variables in R such that p(y) > 0. We define the conditional probability distribution (CPD) of X given Y = y as: p(x|y) = p(x,y) p(y) , or p(x, y) = p(y) · p(x|y), for all configurations in Vx × Vy . Causal Models Example of a Causal Model Several authors express causal models in probabilistic terms because, as argued by Suppes (Suppes 1970), most causal statements in everyday conversation are a reflection of probabilistic and not categorical relations. For that reason, probability theory should provide an adequate framework for reasoning with causal knowledge (Good 1983; Reichenbach 1956). Pearl’s causal models provide the mechanism and structure needed to allow for a representation of causal knowledge based on the presence and absence of CIs. Businesses gather information to better understand needs and preferences of customers. That knowledge helps businesses make decisions such as expanding product lines, targeting a particular age group in promotions, etc. With the possibility to quickly transmit masses of information over the web, companies can easily and inexpensively gather useful data when customers view a company’s website. For example, consider the fictional airline company Fly4U, an international carrier specializing in leisure travel. Fly4U investigates user behaviour by means of causal models. Each month (since travel varies widely by season), a causal model is created for the potential popularity of a specific destination. The company obtains user information through their website containing travel information for all available destinations. Upon every visit to the website, Fly4U collects a profile of the user, acquired in the form of a short survey. The causal model in Figure 1 describes causal relationships between five variables directly related to the choice of a destination on a given Definition 1 A causal model (Pearl & Verma 1991) of a set of random variables R can be represented by a directed acyclic graph (DAG), where each node corresponds to an element in R and edges denote direct causal relationships between pairs of elements of R. The direct causal relations in the causal model can be expressed in terms of probabilistic conditional independencies (CIs) (Pearl 1988). Definition 2 Let R = {A1 , A2 , . . . , An } denote a finite set of discrete variables, where each variable 473 Variable ¯ A month. The five variables are the following: (A)ge group of user, (C)ost of hotels user is considering, known holidays at place of (O)rigin of user, whether user (H)as children, and finally, (D)estination of travels. Fly4U uses the information to establish ticket rates to maximize sales and offer promotional rebates for relevant travel groups on any given month. A H C O C O H Di 0 < 25 Inexpensive No No No 1 ≥ 25 Expensive Yes Yes Yes Table I: Instantiations for A, C, O, H, and D. example, which is based solely on CI. We then introduce CSI and how it allows us to consider contexts and therefore have a starting point for finding hidden variables. Once again, we illustrate this with our running example. In the subsequent subsection, we show how the discovery of CSIs in the data helps refine and correct our existing model. D Figure 1: Example Causal Model. Instantiation of a CPD in Fly4U Example For simplicity, we model the month of February only. According to the DAG, there is a direct causal relationship between variable A and C. There is also a direct causal influence from A to the origin (O). The last causal relationship emerging from A is dirented to the destination (D). Whether the user has children (H) is causally related to relevant holidays at the point of origin (O). In turn, variable O is directly related to D, the destination. Finally, hotel cost (C) and destination (D) are directly related. The CPDs corresponding to variables A, H, O, C, and D respectively are the following: p(A), p(H), p(O|A, H), p(C|A) , p(D|A, C, O). The causal model for February in Figure 1 seems reasonable and intuitive. However, discovery of hidden variables may yield some interesting information about subsets of the variables, that may give useful indicators as to what user group to target when offering promotional material. Pearl and Verma (Pearl & Verma 1991) make clear that possibility in their treatment of exceptions: “causal expressions often tolerate exceptions, primarily due to missing variables and coarse descriptions” (Pearl & Verma 1991). From the user profiles, Fly4U wants to better understand traveling trends in February. For ease of illustration, we assume the domain of every variable to be binary. To make this possible for the (D)estination variable, we let Di = D1 . . . Dn , where Di represents one particular destination. We further separate the data into n causal models, one for each destination on a given month. In a given causal model for month X and destination Di , the variable Di takes on value 0 for no travel to destination Di , and 1 for travel to destination Di . The causal model in Figure 2 illustrates this slight modification, where we investigate travel to destination Di for the month of February. Instantiations for the five variables are presented in Table I. February A H C O Di Figure 2: Causal model for February. Discovery of Hidden Variables From the causal model in Figure 2, variables directly related to Di are Cost of Hotel (C), Age Group (A), and Major Relevant Holiday at Origin (O). The CPD associated with these conditionally independent variables is presented in Table II. Since BNs operate on the general notion of CIs, it is difficult to consider hidden variables in the data or even to be aware of their presence. In this section, we first instantiate a CPD from our running 474 A 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 C 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 O 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 Di 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Definition 4 Let X, Y, Z, C be pairwise disjoint subsets of variables in R, and let c ∈ Vc . We say that Y and Z are conditionally independent given X in context C = c, denoted IC=c (Y, X, Z) if, p(Di |A, C, O) 0.80 0.20 0.10 0.90 0.80 0.20 0.10 0.90 0.15 0.85 0.15 0.85 0.05 0.95 0.05 0.95 p(y|x, z, c) = p(y|x, c), when p(x, z, c) > 0. Since we are dealing with partial CPDs, a more general operator is necessary for manipulating CPDs containing CSIs. This operator (Zhang & Poole 1999) is called the union-product, and is represented as . CSI Discovery Table II: The CPD p(Di |A, C, O). To attempt to target specific groups within the available information, we need to consider context. If we can conclude that there exists a particularity about a context of a variable, we may be able to detect a group responsible for the partial popularity of travels to Di during the month of February. In this subsection, we see that a consideration of context may actually change the original model in Figure 2. We use a CSI detection method called Refine-CPD-tree (Butz & Sanscartier 2002). The method is based on a tree representation of a CPD. Using this algorithm, we can see if a tree reduction is possible. If such a reduced tree exists, the data contains a CSI, which is an indication of a hidden variable that could perhaps correct a faulty model that may otherwise appear to be correct. The detection method works as follows: Given a tree representation of a CPD, if all children of a node A are identical, then replace A by one of its offspring, and delete all other children of A. In our running example, we have a CPD, which contains all available information relevant to travels to Di in February, as depicted in Table II. Recall that no variables can be removed from that distribution based on CI, since the independence would have to hold for all values in the distribution. The Refine-CPD algorithm can determine if context-specific independencies reside in the data. The CPD in Table II can be represented as the CPD-tree in Figure 3. Running the Refine-CPD algorithm yields the refined CPD-tree in Figure 4. The variable C no longer appears on the LHS of the tree, in the context A = 0. In addition, on the RHS of the tree, in context A = 1, the variable O no longer appears. This suggests a hidden relationship in variable A in context A = 0 and in context A = 1. Table II shows that some users have strong interest in traveling to Di , while others show no interest at all. There is no clear indication that February is particularly popular for travels to Di for a group of users; if it were the case for the majority of users, most probability values in the table would be quite high. From the available information, we cannot determine if a particular subset of users is more likely to travel to Di in February. We know that some interest is shown, but no information indicates whether the interest is random or particular to a discrete subset. Finally, it is impossible to remove any of the variables in the table without altering the probability values of other configurations. All values would have to be marginalized and re-normalized and would therefore be different. Below, we see how a discovery method for hidden variables reveals strong influences hidden in this seemingly inconclusive CPD. Context-Specific Independence (CSI) Boutilier et al. (Boutilier et al. 1996) formalized the notion of context-specific independence. Without CSI, it is only possible to establish a causal relationship between two variables if a certain set of CIs is absent for all values of a variable in the distribution. With CSI, we can recognize CIs that hold for a subset of values of a variable in a distribution. Discovery of CSI can help us build more specific causal models instead of a single causal model ignoring particular subsets of values. Boutilier et al. define CSI as follows: 475 A 0 C 0 0 O O 0 Di 0 1 0.80 0.20 0.10 0 Di 1 0.90 0.80 0 1 Di 1 C 0 1 1 Di 0 1 0 0.20 0.10 O O 1 0 Di 1 0 Di 0 1 0.90 0.15 1 0.85 0.15 1 Di 1 0 Di 1 0.85 0.05 0 1 0.95 0.05 0.95 Figure 3: Initial CPD-tree for p(Di |A, C, O). A 0 0 O C 1 0 Di 0 0.80 Di 1 ACODi 1 0 0.20 0.10 1 1 Di 0 0.90 0.15 Di 1 0 0.85 0.05 1 0.95 Figure 4: Refined CPD-tree for p(Di |A, C, O). Uncovering Hidden Variables The previous subsection showed that a CSI discovery algorithm can uncover hidden relationships in a CPD when no causal independencies can be inferred by considering the entire dataset. The example showed that some user group in A (A < 25) may help explain the partial popularity of travels to Di in February. If we look again at Table II, and consider A = 0 and A = 1 separately, removing C from the distribution in configurations where A = 0 doesn’t change the likelihood of occurrence of Di , whereas such a removal would be impossible in the context A = 1. In A = 0, p(Di |A = 0, O, C) = 0.80 when O = 0 and Di = 0, 0.20 when O = 0 and Di = 1, 0.90 when O = 1 and Di = 0, and finally, 0.10 when O = 1 and Di = 1. In context A = 1, saying p(Di |A = 1, O, C) = 0.15 when O = 0 and Di = 0, is not completely correct since it is also true that in context A = 1, p(Di |A = 1, O, C) = 0.05 when O = 0 and Di = 0. In the first case of context A = 1, C = 0, while in the second case, C = 1. Therefore, the value of C does change the probability of travels to Di in context A = 1, so no removal is possible. We conclude that in context A = 0, variables Di and C are independent given variable O. Such a separation is legal since 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 p(Di |A,C,O) 0.80 0.20 0.10 0.90 0.80 0.20 0.10 0.90 0.15 0.85 0.15 0.85 0.05 0.95 0.05 0.95 (i) ACODi p(Di |A=0,C,O) ACODi p(Di |A=1,C,O) 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 0.80 0.20 0.10 0.90 0.80 0.20 0.10 0.90 (ii) 0.15 0.85 0.15 0.85 0.05 0.95 0.05 0.95 → → AODi p(Di |A=0,O) ACDi p(Di |A=1,C) 000 001 010 011 100 101 110 111 0.80 0.20 0.90 0.10 0.15 0.85 0.05 0.95 (iii) Table III: Representation of CSIs no information is lost due to the union-product operator discussed in Section . From the resulting CPDs, we may now make more adequate judgments about the user. The CPD after refinement is presented in Table III. When variable A takes on value 0 (A < 25), travel to Di in February is extremely popular and hotel cost doesn’t seem to be an issue. This puzzling conclusion leads Fly4U to investigate further. Since, also in that context, there is a strong dependency between the age group (A) and a relevant holiday at the origin (O), that factor is also taken into consideration for the investigation. This discovery allows Fly4U to enter a smaller reference class for their investigation, which makes the problem easier to resolve. A large number of people aged under 25 are university students. In February, universities have “Reading Week” break. During Reading Week, many travel on organized trips. Due to group discounts, 476 Conclusions and Future Work hotel costs are not a large issue despite the travellers’ young age. On the other hand, users aged over 25 are likely no longer students and are probably working. Most places of employment do not have a February break and therefore, variable O, holiday at origin, does not serve that portion of the population. Finally, if people from that group (over 25) were traveling in February regardless, they would be more likely to pay attention to hotel costs. Studies frequently reveal that learners learn differently in different contexts or that marketing strategies perform differently in different contexts. For example, continuing with the theme of travel websites, it seems reasonable to expect that different groups will respond differently to presentation styles, and website designers will want to identify membership in a subgroup before dynamically choosing such a strategy. Given sufficient traffic, it may be possible to build very detailed models and to discover many context-specific strategies for maximizing sales in target groups. The example shown here shows that even given smaller models, it is possible to discover such contexts, and that such contexts lead us to very different conclusions. Correcting the Model Since there is no longer mention of variable C in context A = 0, we can refine our causal model by removing the direct causal link between C and Di , and similarly in context A = 1 for variable O. With the uncovered hidden contexts of variable A, when considering the probability of travel to destination Di , given all factors that have a direct causal link with Di , the initial causal model in Figure 2 can be represented by two more specific causal models that account for differences between particularities about subsets of the data depending on age group. Those refined causal models are illustrated in Figure 5, where the LHS represents the refined model for context A = 0 (under 25), and the RHS represents the refined model for context A = 1 (over 25). A Di H A O C References Boutilier, C.; Friedman, N.; Goldszmidt, M.; and Koller, D. 1996. Context-specific independence in bayesian networks. In Proc. of the 12th UAI, 115–123. Butz, C., and Sanscartier, M. 2002. A method for detecting context-specific independence in conditional probability tables. In 3rd Int’l. RSCTC02, 344–348. Good, I. 1983. A causal calculus. British Journal for Philosophy of Science 11. Pearl, J., and Verma, T. 1991. A theory of inferred causation. In Principles of Knowledge Representation and Reasoning, 441–452. Morgan Kaufmann. Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Fransisco USA: Morgan Kaufmann. Reichenbach, H. 1956. The Direction of Time. Berkeley: University of California Press. Sanscartier, M., and Neufeld, E. 2006. Discovering hidden dispositions and situational factors in causal relations by means of contextual independencies. In 28th Conference of the Cognitive Science Society. Suppes, P. 1970. A Probabilistic Theory of Causation. Amsterdam: North Holland. Zhang, N., and Poole, D. 1999. On the role of context-specific independence in probabilistic reasoning. In 16th IJCAI, 1288–1293. Di Figure 5: Causal Models After Discovery From the discovery of CSI and an interpretation of the results permitted by the smaller reference class, we see that the causal models are substantially different when age groups are considered separately. From the study, Fly4U could benefit by offering group discounts to students for February, since they now know that a large portion of the student population will be traveling. No airline company with that kind of insight wants to lose an entire target group of customers to another airline! The techniques presented in this example can generalize to the wider class of uncertain reasoning. 477